Patches and Documents related to Page Fault Performance in the Linux Kernel

Table of Contents

Patchsets

  1. Combined Patch (prezeroing, prefaulting and atomic page table entry operations)
  2. Atomic Page Table Operations
  3. Pre-zeroing
  4. Anticipatory Preallocation of Pages
  5. Hierachical Backoff Locks
  6. Page Faults for Huge Pages

Descriptions

  1. Page fault handler
  2. Counter issues
  3. Atomic Page Table Operations
  4. Pre-zeroing
  5. Anticipatory Preallocation of Pages
  6. Bibliography

Performance Benchmarks

  1. Summary
  2. Kernel Compilation
  3. Microbenchmark
  4. lmbench

The Page Fault Handler

The page fault handler is a critical piece of code in the Linux kernel that has a major influence on the performance of the memory subsystem. A process has a memory map (page table) assigned containing page table entries which point to memory assigned to the process and describes to the CPU which memory locations are valid for a process to access. Memory is managed and assigned to processes in chunks called pages. Initially a process has no pages assigned at all and the page table is empty. Any access to memory areas not mapped by the page table result in the generation of a page fault by the CPU. The operating system must provide a page fault handler that has to deal with the situation and determine how the process may continue. On startup a process must set up its own memory areas and therefore page faults occur most frequently when a new process is started.

In order to allow access to pages of memory never accessed before, the page fault handler must reserve a free page of memory and then make that page visible to the process by adding a page table entry to the page table. This process is critical: Page fault performance determines how quickly a process can acquire memory and is of particular importance for applications that utilize large amounts of memory. With the use of larger and larger amounts of memory more and more page faults will need to be generated and handled by Linux. The scalability of the page fault handler becomes more and more an important issue.

The Linux page fault handler relies on acquiring a read-write semaphore (mmap_sem) and a spin lock, the page_table_lock for synchronization between multiple threads of a task. A page fault first acquires a read lock on mmap_sem (which alone would allow other threads to continue processing page faults) and then acquires a spin lock on the page_table_lock (which serializes access to the page table and important data structures) before acquiring a free page from the page allocator. The page is then cleared by overwriting the contents with zeros (only initialized memory is provided to processes!) and the page is assigned to the process by creating a corresponding page table entry in the page table of the process. The page fault handler is a very hot code path, sensitive to minor code changes and depends heavily on the organization of data structures. Cache line bouncing has a critical influence on page fault performance in SMP systems and becomes particularly significant for large applications (like huge databases or computational applications) that try to minimize startup time by having multiple threads of a process running on different processors in order to initialize their memory structures concurrently.

Performance measurements show that current (2.6.10) code in the Linux page fault handler only scales well up to 2 or 4 CPUs. Performance is negatively impacted by larger CPU counts and becomes worse than a single thread for 16 CPUs. Performance may drop to a fraction of single thread performance for SMP systems with more than 64 CPUs.

Three means of optimizing page fault performance are covered here. First, one may avoid the use of the page table lock through atomic operations on page table entries. Then multiple faults may occur concurrently on multiple CPUs in an SMP system. One must also take additional measures to avoid cache line bouncing. Second, the page fault handler may analyze the access pattern of the process. Optimizing for sequential memory allocation is then possible by anticipating future accesses. Multiple pages can be preallocated and multiple page table entries may be generated in a single page fault. The locking overhead is reduced since the fault handler is not that frequently invoked anymore and therefore SMP performance improves. Finally, if zeroed pages are available then the page fault handler may simply assign a zeroed page to the process avoiding the clearing of pages in the fault handler. This will reduce the number of times the page table lock has to be acquired.

Atomic Operations on the Page Table

In order to avoid the use of the page table lock for most of the page fault handler, atomic operations on the page table entries have to be defined for each architecture supported and page fault handling has to be rewritten to be able to either integrate atomic operations with existing uses of the page table or the page table lock has to be dropped completely. Nick Piggin has developed a patch to completely do away with the page table lock. However, performance is better if one can still use the page table lock for larger operations on the page tables and if one instead restricts the potential atomic operations on the page table to critical operations. In the proposed solution atomic operations are only used to populate empty page table entries. Other operations still need to acquire the page table lock.

The changes to make atomic operations possible on the page table are not possible for all platforms since the page table entries on some platforms are larger than the word size of the platform making it impossible to atomically update the entries. A fall back mechanism was designed that allows to use the page table lock instead of using atomic operations.

Some more details: The performance increase is accomplished by avoiding the use of the page_table_lock spinlock (but not mm->mmap_sem) through new atomic operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd, pud and pgd's (pgd_test_and_populate, pud_test_and_populate, pmd_test_and_populate).

The page table lock can be avoided in the following situations:

An empty pte or pmd entry is populated

This is safe since the swapper may only depopulate them and the swapper code has been changed to never set a pte to be empty until the page has been evicted. The population of an empty pte is frequent if a process touches newly allocated memory.

Modifications of flags in a pte entry (write/accessed)

These modifications are done by the CPU or by low level handlers on various platforms also bypassing the page_table_lock. So this seems to be safe too.

Locking issues for counters in mm_struct

One additional complication is that the page table lock is also used to serialize access to some counters (RSS) in mm_struct. To solve the RSS issue a number of approaches have been proposed: making RSS atomic, ignoring the errors generated by concurrent accesses, dynamically calculating RSS when its needed and finally splitting RSS into another structure that is task specific and may be used without locking.

iAtomic Counters

Atomic counters have the disadvantage that they require a synchronization and may lead to bouncing cachelines. Multiple atomic operations may offset the benefit obtained by not acquiring the page_table_lock.

Dynamically Calculated Counters

Counters in earlier Linux versions were calculated dynamically which avoided the maintenance of counters in the vm paths. Access to the data will then require a scan through the page table to determine the pages used (RSS).

Split RSS into task_struct

A split counter may avoid atomic operations and locks currently necessary for rss modifications. In addition to mm->rss, tsk->rss is introduced. tsk->rss is defined to be in the same cache line as tsk->mm (which is already used by the fault handler) and thus tsk->rss can be incremented without locks in a fast way. The cache line does not need to be shared between processors for the page table handler.

Anticipatory Preallocation

Anticipatory preallocation of multiple pages requires new variables for each task in order to allow the tracking of where the last page fault occurred. If page faults occur in sequence then more than one page is preallocated per fault. If the sequence continues then the number of pages preallocated is increased until a preset limit in /proc/sys/vm/max_order_prealloc is reached. A drawback to preallocation is that unnecessary page table entries may be generated. These will be removed by the swapper (if memory becomes scarce) or the process terminates but the preallocated pages will stay in memory until that point. The advantage of preallocation is that performance may be improved for concurrent SMP page faults without extensive modifications to VM locking.

The preallocation may be controlled by setting /proc/sys/vm/max_prealloc_order. By default the order is set to 1. This means that only one page may be preallocated. For applications that allocate and use large amounts of memory a setting of 4 is giving the most benefit. However, a larger setting may decrease performance for other applications slightly since unnecessary pages may be allocated. The performance decrease for kernel compilation is around 5% if max_prealloc_order is set to 4.

Prezeroing

Prezeroing of pages depends on the availability of a pool of zeroed pages and therefore requires a modification of the buddy allocator to not only manage free pages but also zeroed pages. The effectiveness of prezeroing against hot zeroing in the page fault handler depends on the zeroing method used. The proposed method of prezeroing only zeros higher order pages because one of the higher order pages may be broken down into multiple order 0 pages for the page allocator. It is more efficient to clear a huge area of memory than multiple smaller areas. The clearing of a page typically invalidates the CPU cache (if not zeroing hardware is available). It is therefore advantageous to reduce the number of times the zeroing function runs as much as possible but if zeroing is started then it is advantageous to zero as huge a chunk of memory as possible. In an ideal case zeroing may be performed through a hardware device which may be able to bypass the CPU caches minimizing the impact of zeroing on the system. Such a mechanism is also presented.

Highest Performance

The ultimate performance for the page fault handler is only reached through the combination of all three approaches. Prezeroing in general doubles or quadruples single fault performance. The atomic operations on the page table reduce the locking overhead and the measure of moving variables into the task structure address cache bouncing issues. Anticipatory prefaulting finally addresses the general overhead of the page fault handler. These approaches combined result in a page fault handler that shows an almost linear performance increase as CPUs are added even in the high end range from 64 to 512 CPUs.

Performance Tests Summary

The page fault scalability patches are only useful for environments with threaded applications in SMP environments and become more beneficial with a higher processor count.

Kernel Compilation

Kernel compilation is a nonthreaded load using some anonymous memory. Page fault scalability issues are mainly encountered with applications accessing large amounts of memory concurrently from multiple threads. There is no significant performance benefit to be expected. The tests listed here are mainly to show that there is no negative impact.

Altix 8 way SMP system, 6G RAM

Unmodified kernel
real    2m53.692s
user    18m55.574s
sys     0m51.603s

Prezeroing patch

scrub_load = 1. Prezeroing only before kernel compilation is started.
real    2m52.757s
user    18m54.735s
sys     0m50.832s
scrub_load = 99. Prezeroing during kernel compilation.
real    2m57.225s
user    19m1.399s
sys     0m42.298s

Prefault patch

max_prealloc_order = 4.
real    3m9.471s
user    18m48.175s
sys     0m55.082s
max_prealloc_order = 1.
real    2m53.537s
user    18m55.664s
sys     0m50.976s

Atomic PTE operations

real    2m53.104s
user    18m55.580s
sys     0m51.112s

Prezeroing on a single processor i386 system

The system is a AMD64 3200+ cpu with 1 GB of memory. Each test run first prints the memory layout before compilation to assure us that memory is properly prezeroed for the test. No hardware zeroing device is available:

Performance with prezeroing switched off (scrub_start == 99) on Linux 2.6.11-rc4-bk2. Both samples are the best times after repeated compilation runs:

Mon Feb 14 11:03:07 PST 2005
Node 0, zone      DMA      2      4      4      4      4      0      0      0      0      0      0
        Zeroed Pages       0      0      0      0      0      1      1      1      1      1      2
Node 0, zone   Normal   2152   1826    691    793   1202    274    119     86     86     58     77
        Zeroed Pages       0      0      0      0      0      0      0      0      0      0      0
278.34user 26.58system 5:25.18elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8053981minor)pagefaults 0swaps
Performance with prezeroing (scrubd_start == 8, scrub_stop ==2):
Mon Feb 14 11:46:07 PST 2005
Node 0, zone      DMA      2      4      0      0      0      0      0      0      0      0      0
        Zeroed Pages       0      0      4      4      4      1      1      1      1      1      2
Node 0, zone   Normal   2910   2187   1763   1250      0      0      0      0      0      0      0
        Zeroed Pages       0      0      1    167    814    217    118     92     82     71     68
278.58user 24.26system 5:24.01elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8053981minor)pagefaults 0swaps
System time for compilation has been reduced somewhat which leads to a 1 second win (~0.3%). But that win is within the margin of potential noise. No benefit should be expected for kernel building from perzeroing.

Micro Benchmark Results

The Microbenchmark used here is a special benchmark written to test page fault performance. The benchmark will repeatedly allocate memory and then touch one location on each of the pages allocated to cause faults on the pages. It measures the number of faults that can be generated per second.

Altix 128 way SMP system, 500G RAM

The scalability problems are the most severe at high counts of CPUs. Without any patches the peak page fault performance is reached with 4 CPUs. Atomic PTE operations allow the system to reach a higher peak at 16 CPUs. The same peak is also reachable with prefaulting. This means that the page_table_lock issues are no longer the bottleneck beyond 16 CPUs. Both patches must increment RSS variables which will lead to another cacheline bouncing. The split RSS patch that used to be in prior releases of the atomic pte operations patch addresses that issue and would allow an even higher page fault rate. These were replaced with atomic variables in later releases in order to simplify the patches.

Baseline 2.6.11-rc4-bk2 without any patches

Allocating 1 GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.04s      1.99s   2.00s 96466.979  96485.259
  1  3    2   1    0.04s      2.81s   1.05s 68759.082 126692.387
  1  3    4   1    0.05s      4.65s   1.04s 41803.677 138548.237
  1  3    8   1    0.12s     12.44s   1.09s 15649.165  99233.376
  1  3   16   1    0.11s     31.04s   2.05s  6308.209  77879.592
  1  3   32   1    0.82s     98.80s   4.00s  1973.422  48432.408
  1  3   64   1    9.29s    293.51s   6.02s   649.266  31229.490
  1  3  128   1   50.20s    588.48s   7.08s   307.830  25090.215
Allocating 4GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.16s     11.13s  11.02s 69608.953  69609.929
  4  3    2   1    0.19s     11.91s   6.04s 64954.511 121523.482
  4  3    4   1    0.15s     18.81s   5.04s 41470.003 143927.689
  4  3    8   1    0.22s     44.42s   6.06s 17611.555 118104.495
  4  3   16   1    0.31s    135.85s   9.08s  5775.452  79887.187
  4  3   32   1    1.40s    420.54s  15.01s  1863.828  51742.635
  4  3   64   1    4.76s   1205.17s  21.06s   649.978  36375.098
  4  3  128   1   45.16s   2416.41s  23.04s   319.482  33511.208
Allocating 16GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
 16  3    1   1    0.77s     64.02s  64.07s 48548.234  48555.210
 16  3    2   1    0.73s     70.05s  37.01s 44436.803  84583.807
 16  3    4   1    0.80s     79.43s  23.06s 39201.468 133261.169
 16  3    8   1    0.70s    207.43s  29.09s 15114.034 105204.451
 16  3   16   1    0.89s    510.13s  36.08s  6155.644  85273.863
 16  3   32   1    1.19s   1643.69s  58.03s  1912.434  53888.906
 16  3   64   1    7.10s   4644.63s  81.02s   676.248  38711.367
 16  3  128   1   67.00s   9873.81s  88.06s   316.445  35485.076
The fault handler does not scale well over 4 concurrent threads allocating memory.

Atomic PTE operations patch

Allocating 1GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.03s      1.99s   2.00s 96698.657  96582.505
  1  3    2   1    0.03s      2.64s   1.04s 73183.070 135507.159
  1  3    4   1    0.04s      2.79s   0.09s 69042.023 213365.094
  1  3    8   1    0.04s      3.08s   0.07s 62718.557 276536.378
  1  3   16   1    0.08s      4.83s   0.08s 39953.675 243596.604
  1  3   32   1    2.17s     11.01s   1.02s 14904.242 157164.893
  1  3   64   1   13.66s     30.31s   2.01s  4470.644  91363.023
  1  3  128   1   74.17s    114.09s   4.01s  1044.262  47285.524
Allocating 4GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.17s     11.05s  11.02s 69990.094  69987.700
  4  3    2   1    0.20s     11.47s   6.02s 67338.915 126201.925
  4  3    4   1    0.15s     10.84s   3.03s 71468.408 231644.971
  4  3    8   1    0.19s     11.70s   2.03s 66089.949 332308.887
  4  3   16   1    0.59s     19.36s   2.04s 39408.172 327444.716
  4  3   32   1    1.32s     44.51s   2.09s 17153.878 263079.172
  4  3   64   1    7.64s    111.83s   4.01s  6581.771 190825.697
  4  3  128   1   40.06s    438.94s   7.03s  1641.789 107286.045
Allocating 16GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
 16  3    1   1    0.76s     63.62s  64.03s 48856.019  48861.435
 16  3    2   1    0.68s     68.13s  36.01s 45709.273  87014.775
 16  3    4   1    0.66s     56.34s  17.04s 55176.839 180100.535
 16  3    8   1    0.64s     52.07s  10.00s 59668.869 311712.600
 16  3   16   1    0.89s     74.79s   8.02s 41562.034 382884.245
 16  3   32   1    2.58s    172.80s   9.06s 17935.745 324358.819
 16  3   64   1    7.13s    436.28s  12.02s  7094.238 257560.257
 16  3  128   1   48.34s   1867.66s  22.06s  1641.819 138871.641
Performance increases up to 16 concurrent threads but then drops off likely due to cacheline bouncing which could be addressed by the split rss patches.

Prefaulting patch

Allocating 1GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.01s      1.91s   1.09s101885.901 101825.209
  1  3    2   1    0.01s      2.08s   1.01s 93640.292 167843.120
  1  3    4   1    0.01s      2.17s   0.07s 89757.719 260366.344
  1  3    8   1    0.08s      2.50s   0.06s 75914.997 289684.994
  1  3   16   1    0.09s      3.06s   0.06s 62233.871 284580.856
  1  3   32   1    2.17s      7.97s   1.01s 19367.630 178274.468
  1  3   64   1    8.27s     27.37s   1.09s  5515.492  99687.208
  1  3  128   1   40.74s     64.23s   3.03s  1872.787  58228.428
Allocating 4GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.07s     10.75s  10.08s 72654.816  72656.012
  4  3    2   1    0.05s      8.59s   4.07s 90933.383 165518.465
  4  3    4   1    0.06s      8.38s   2.07s 93066.684 284662.744
  4  3    8   1    0.07s      9.03s   2.00s 86322.862 386625.024
  4  3   16   1    0.11s     11.24s   1.09s 69255.768 411677.546
  4  3   32   1    1.64s     31.14s   2.05s 23981.001 309606.010
  4  3   64   1   11.82s    106.65s   3.09s  6638.031 198125.131
  4  3  128   1   42.85s    260.29s   5.08s  2594.230 134793.726
Allocating 16GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
 16  3    1   1    0.26s     62.51s  62.07s 50106.145  50108.917
 16  3    2   1    0.24s     58.35s  31.00s 53682.592 101220.626
 16  3    4   1    0.23s     48.93s  16.03s 63984.266 192642.348
 16  3    8   1    0.22s     38.13s   7.09s 82017.158 394527.495
 16  3   16   1    0.35s     42.59s   6.01s 73249.587 510574.838
 16  3   32   1    2.34s    135.53s   8.03s 22815.139 377974.474
 16  3   64   1    6.52s    412.73s  11.04s  7503.031 275621.659
 16  3  128   1   57.21s   1106.36s  15.09s  2703.487 196780.174
The benefits from prefaulting are even greater than the atomic operations but it also only scales well up to 16 concurrent threads. However, this method may generate additional useless page table entries.

Prezeroing patch

scrub_load = 99. scrub_start = 4. scrub_stop = 2. Allocating 1 GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.05s      0.46s   0.05s380579.983 379122.948
  1  3    2   1    0.05s      1.43s   0.08s132626.202 228064.767
  1  3    4   1    0.05s      4.08s   1.02s 47527.510 153619.452
  1  3    8   1    0.05s     11.11s   1.07s 17601.547 113665.070
  1  3   16   1    0.35s     27.42s   2.02s  7078.742  87181.716
  1  3   32   1    1.21s     88.22s   3.05s  2198.151  55077.202
  1  3   64   1    7.21s    233.23s   5.01s   817.672  38532.779
  1  3  128   1   51.37s    562.34s   7.02s   320.355  27003.201
Allocating 4 GB
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.19s      3.14s   3.03s235607.454 235447.568
  4  3    2   1    0.16s      6.07s   3.05s125966.857 222342.763
  4  3    4   1    0.18s     14.77s   4.07s 52572.530 167226.098
  4  3    8   1    0.15s     42.85s   6.04s 18283.709 121987.954
  4  3   16   1    0.29s    112.35s   8.03s  6981.594  94505.654
  4  3   32   1    1.51s    363.61s  13.00s  2153.806  60210.738
  4  3   64   1    9.92s   1007.38s  18.03s   773.055  42786.801
  4  3  128   1   48.13s   2347.83s  22.05s   328.231  34884.233
The benefits here surface even for single thread performance measurements. The shorter execution time of the page fault handler results in some boost to the numbers at a high cpu count but not much.

Altix 8 Way SMP system, 6G RAM

Allocating 1GB of memory with an increasing number of threads:
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.04s      2.22s   2.02s 86891.036  86893.388
  1  3    2   1    0.04s      3.36s   1.08s 57686.705 109111.707
  1  3    4   1    0.03s      4.17s   1.03s 46646.555 144286.301
  1  3    8   1    0.35s      5.32s   1.04s 34604.081 139312.992
Allocating 4GB.
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.19s     11.36s  11.05s 68015.711  68006.390
  4  3    2   1    0.17s     15.75s   8.05s 49356.828  92052.173
  4  3    4   1    0.17s     16.37s   5.02s 47505.073 148577.738
  4  3    8   1    0.18s     29.91s   5.01s 26126.788 151435.451

Atomic Page Table Operations patch

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.04s      2.18s   2.02s 87915.508  87858.256
  1  3    2   1    0.04s      3.21s   1.07s 60277.400 113595.465
  1  3    4   1    0.03s      3.31s   1.01s 58695.788 164231.902
  1  3    8   1    0.03s      3.54s   1.00s 55007.262 180576.786
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.17s     11.35s  11.05s 68205.814  68193.327
  4  3    2   1    0.17s     16.90s   8.09s 46064.868  88107.475
  4  3    4   1    0.17s     13.21s   4.02s 58742.872 183102.224
  4  3    8   1    0.16s     15.61s   3.04s 49836.375 227616.222
Page fault performance increeases around 50% at 8 cpus.

Prezeroing patch

The numbers were obtained with scrub_load = 99, scrub_start = 4 and scrub_stop =2. Samples vary depending on when the zeroed memory that was accumulated before becomes exhausted and depending on the rate that this zeroed memory can be replenished. If not all cpus are in use then other idle cpus will take over the zeroing. The benchmark only touches one cacheline which increases the effect of prezeroing. But the benchmark also continually allocated memory draining the pool of zeroed pages. In practice an application will not continually allocate memory in this fashion and therefore the zeroed pages will not run out as fast as here but a real application usually touches more than one cacheline of a page.
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  1    1   1    0.01s      0.12s   0.01s459653.380 459109.663
  1  1    2   1    0.01s      0.26s   0.01s235470.809 383970.576
  1  1    4   1    0.01s      0.33s   0.02s189039.428 313930.323
  1  1    8   1    0.23s      0.53s   0.03s 85380.248 164070.448

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  1    1   1    0.01s      0.12s   0.01s466037.092 464177.888
  1  1    2   1    0.01s      0.31s   0.01s201528.933 340726.583
  1  1    4   1    0.02s      0.75s   0.02s 83781.414 219356.554
  1  1    8   1    0.00s      0.15s   0.01s404271.200 404814.488

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.04s      0.85s   0.08s218595.663 218569.383
  1  3    2   1    0.05s      0.94s   0.06s197379.160 304787.346
  1  3    4   1    0.04s      0.89s   0.07s209497.100 268058.928
  1  3    8   1    0.04s      1.74s   0.08s109894.459 241376.586

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  1    1   1    0.05s      0.62s   0.06s387353.215 385558.099
  4  1    2   1    0.06s      1.84s   1.01s136956.901 227575.591
  4  1    4   1    0.05s      3.20s   1.01s 80225.757 227674.193
  4  1    8   1    0.05s      5.28s   1.02s 49101.036 202901.977

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  1    1   1    0.05s      0.62s   0.06s384578.370 384956.712
  4  1    2   1    0.07s      2.23s   1.02s113503.379 204275.877
  4  1    4   1    0.04s      3.05s   1.03s 84493.351 200452.447
  4  1    8   1    0.06s      2.21s   1.00s114863.296 257988.540

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.20s      3.46s   3.06s214633.864 214652.933
  4  3    2   1    0.15s      6.14s   3.08s124718.309 204314.102
  4  3    4   1    0.16s      9.49s   3.07s 81409.832 211840.077
  4  3    8   1    0.20s     13.20s   4.00s 58657.298 193572.915
Page fault performance increases up to a factor of 5. The improvements hold up to a higher cpu count if enough zeroed memory is available.

Prefaulting patch

/proc/sys/vm/max_prealloc_order = 4.

Allocating 1GB

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  3    1   1    0.01s      2.11s   2.01s 92224.730  92186.501
  1  3    2   1    0.01s      3.17s   1.06s 61586.576 116063.254
  1  3    4   1    0.01s      3.12s   1.02s 62582.064 161867.866
  1  3    8   1    0.01s      3.16s   1.01s 61946.646 178650.872
Allocating 4GB
Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  4  3    1   1    0.04s     11.05s  11.01s 70808.584  70803.038
  4  3    2   1    0.05s     16.60s   8.07s 47201.572  90122.438
  4  3    4   1    0.07s     11.74s   3.09s 66532.220 196734.678
  4  3    8   1    0.30s     13.59s   3.01s 56584.180 247731.744
Performance improvements are similar to the atomic page table operations patch.

i386 single processor, 1G RAM

The measurements with a single processor were taken to investigate what effect these patches have on a standard desktop machine.

Performance without any patches

200 Megabyte allocation. No patches. Single thread.

 Mb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
200  3    1   1    0.00s      0.17s   0.01s844090.542 845814.978
200  3    1   1    0.00s      0.17s   0.01s853466.095 853376.002
200  3    1   1    0.00s      0.17s   0.01s853466.095 855138.626
200  3    1   1    0.00s      0.17s   0.01s848750.083 847934.815
200  3    1   1    0.00s      0.17s   0.01s848754.773 848248.554
200  3    1   1    0.00s      0.17s   0.01s844085.903 843914.312
200  3    1   1    0.01s      0.16s   0.01s848750.083 850573.694
200  3    1   1    0.00s      0.17s   0.01s844090.542 845451.843
600 Megabyte allocation. No patches. Single thread:
 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
600  3    1   1    0.01s      0.49s   0.05s851783.509 851389.104
600  3    1   1    0.01s      0.47s   0.04s886129.730 886490.815
600  3    1   1    0.02s      0.45s   0.04s908101.462 908138.994

Prezeroing

200 Megabyte allocation with prezeroing. Single thread. Assurance that there is enough prezeroed memory available before the test to handle the complete allocation without additional zeroing.

 Mb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
200  3    1   1    0.00s      0.02s   0.00s3756674.275 3712403.061
200  3    1   1    0.00s      0.03s   0.00s3488295.668 3501888.597
200  3    1   1    0.00s      0.03s   0.00s3407159.305 3420844.913
600 Megabyte allocation that requires rezeroing while the test is running.
 Mb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
600  3    1   1    0.01s      0.10s   0.03s3407167.058 1103645.811
600  3    1   1    0.01s      0.11s   0.04s3433763.088 956299.303
600  3    1   1    0.01s      0.10s   0.03s3516186.590 1101432.893
Performance improves fourfold if enough prezeroed memory is available. Allocation speed is still better even if scrubd needs to zero pages in the background. Note that the fault rates are much higher than on ia64. This is due to the smaller page size of 4k against the ia64 page size of 16k.

Atomic Page Table Operations

The kernel was compiled with SMP in order to be able use atomic operations.

600 Megabyte allocation:

 Mb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
600  3    1   1    0.02s      0.47s   0.04s882570.432 882464.095
600  3    1   1    0.02s      0.47s   0.05s875540.423 875338.122
600  3    1   1    0.02s      0.47s   0.04s884348.279 884415.911
200 Megabyte allocation:
 Mb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
200  3    1   1    0.00s      0.15s   0.01s904367.368 902906.876
200  3    1   1    0.00s      0.15s   0.01s904361.784 906107.149
200  3    1   1    0.00s      0.15s   0.01s904361.784 903307.763
There is no material change from the numbers listed above for 600MB allocations without prezeroing. The numbers for 200Mb improve by 5-8%.

Prefaulting

200 MB allocation
Mb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
200  3    1   1    0.00s      0.17s   0.01s848750.083 847902.050
200  3    1   1    0.00s      0.17s   0.01s853466.095 853565.693
200  3    1   1    0.00s      0.17s   0.01s848750.083 849595.115
200  3    1   1    0.00s      0.17s   0.01s848754.773 848628.162
200  3    1   1    0.00s      0.17s   0.01s844090.542 844554.657
200  3    1   1    0.00s      0.17s   0.01s848754.773 848590.654
200  3    1   1    0.00s      0.17s   0.01s848750.083 848792.295
200  3    1   1    0.00s      0.17s   0.01s844090.542 846406.904
600 MB allocation
 Mb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
600  3    1   1    0.00s      0.16s   0.01s856766.858 854672.330
600  3    1   1    0.00s      0.16s   0.01s851784.873 851057.402
600  3    1   1    0.00s      0.16s   0.01s856766.858 853173.979
No impact in single user mode.

LMBench

Altix 8 way SMP, 6G RAM

Testing done with LMBench 3.0-a4 on an Altix 8p 1.3Ghz Itanium.

LMBench output may be found here. The test shows no significant performance regressions. Some system operations are accellerated for page zeroing.

Prezeroing, prefaulting and atomic page table operations on i386, single processor

Testing was done with LMBench 3.0-a4 on a AMD64 3200+ with 1 GB Ram.

The LMBench output is available here

The data shows basically that the performance of all tests stays similar. There seems to be a minor slowdown in some tests for prezeroing and some minor improvements for atomic page table operations.

Bibliography

Implementation of Multiple Pagesize Support in HP-UX

By Indira Subramanian, Cliff Mather, Kurt Peterson, and Balakrishna Raghunath.

USENIX Presentation, 1998

Description of the implementation of multiple page support in HP-UX.

Itanium Page Tables and TLB

By Matthew Chapman, Ian Wienand, Gernot Heiser.

Paper, May 2003

Suggestions on how to use the features of the Itanium MMU for Linux.

Transparent operating system support for superpages

PhD dissertation of Juan E. Navarro.

Navarro develops methodologies to elevate single pages to superpages in order to reduce TLB use.

Zoran Radovic and Erik Hagersten on NUMA locks

RH and HBO Locks Home Page.

Contact