In order to allow access to pages of memory never accessed before, the page fault handler must reserve a free page of memory and then make that page visible to the process by adding a page table entry to the page table. This process is critical: Page fault performance determines how quickly a process can acquire memory and is of particular importance for applications that utilize large amounts of memory. With the use of larger and larger amounts of memory more and more page faults will need to be generated and handled by Linux. The scalability of the page fault handler becomes more and more an important issue.
The Linux page fault handler relies on acquiring a read-write semaphore (mmap_sem) and a spin lock, the page_table_lock for synchronization between multiple threads of a task. A page fault first acquires a read lock on mmap_sem (which alone would allow other threads to continue processing page faults) and then acquires a spin lock on the page_table_lock (which serializes access to the page table and important data structures) before acquiring a free page from the page allocator. The page is then cleared by overwriting the contents with zeros (only initialized memory is provided to processes!) and the page is assigned to the process by creating a corresponding page table entry in the page table of the process. The page fault handler is a very hot code path, sensitive to minor code changes and depends heavily on the organization of data structures. Cache line bouncing has a critical influence on page fault performance in SMP systems and becomes particularly significant for large applications (like huge databases or computational applications) that try to minimize startup time by having multiple threads of a process running on different processors in order to initialize their memory structures concurrently.
Performance measurements show that current (2.6.10) code in the Linux page fault handler only scales well up to 2 or 4 CPUs. Performance is negatively impacted by larger CPU counts and becomes worse than a single thread for 16 CPUs. Performance may drop to a fraction of single thread performance for SMP systems with more than 64 CPUs.
Three means of optimizing page fault performance are covered here. First, one may avoid the use of the page table lock through atomic operations on page table entries. Then multiple faults may occur concurrently on multiple CPUs in an SMP system. One must also take additional measures to avoid cache line bouncing. Second, the page fault handler may analyze the access pattern of the process. Optimizing for sequential memory allocation is then possible by anticipating future accesses. Multiple pages can be preallocated and multiple page table entries may be generated in a single page fault. The locking overhead is reduced since the fault handler is not that frequently invoked anymore and therefore SMP performance improves. Finally, if zeroed pages are available then the page fault handler may simply assign a zeroed page to the process avoiding the clearing of pages in the fault handler. This will reduce the number of times the page table lock has to be acquired.
The changes to make atomic operations possible on the page table are not possible for all platforms since the page table entries on some platforms are larger than the word size of the platform making it impossible to atomically update the entries. A fall back mechanism was designed that allows to use the page table lock instead of using atomic operations.
Some more details: The performance increase is accomplished by avoiding the use of the page_table_lock spinlock (but not mm->mmap_sem) through new atomic operations on pte's (ptep_xchg, ptep_cmpxchg) and on pmd, pud and pgd's (pgd_test_and_populate, pud_test_and_populate, pmd_test_and_populate).
The page table lock can be avoided in the following situations:
The preallocation may be controlled by setting /proc/sys/vm/max_prealloc_order. By default the order is set to 1. This means that only one page may be preallocated. For applications that allocate and use large amounts of memory a setting of 4 is giving the most benefit. However, a larger setting may decrease performance for other applications slightly since unnecessary pages may be allocated. The performance decrease for kernel compilation is around 5% if max_prealloc_order is set to 4.
Kernel compilation is a nonthreaded load using some anonymous memory. Page fault scalability issues are mainly encountered with applications accessing large amounts of memory concurrently from multiple threads. There is no significant performance benefit to be expected. The tests listed here are mainly to show that there is no negative impact.
real 2m53.692s user 18m55.574s sys 0m51.603s
real 2m52.757s user 18m54.735s sys 0m50.832sscrub_load = 99. Prezeroing during kernel compilation.
real 2m57.225s user 19m1.399s sys 0m42.298s
real 3m9.471s user 18m48.175s sys 0m55.082smax_prealloc_order = 1.
real 2m53.537s user 18m55.664s sys 0m50.976s
real 2m53.104s user 18m55.580s sys 0m51.112s
Performance with prezeroing switched off (scrub_start == 99) on Linux 2.6.11-rc4-bk2. Both samples are the best times after repeated compilation runs:
Mon Feb 14 11:03:07 PST 2005
Node 0, zone DMA 2 4 4 4 4 0 0 0 0 0 0
Zeroed Pages 0 0 0 0 0 1 1 1 1 1 2
Node 0, zone Normal 2152 1826 691 793 1202 274 119 86 86 58 77
Zeroed Pages 0 0 0 0 0 0 0 0 0 0 0
278.34user 26.58system 5:25.18elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8053981minor)pagefaults 0swaps
Performance with prezeroing (scrubd_start == 8, scrub_stop ==2):
Mon Feb 14 11:46:07 PST 2005
Node 0, zone DMA 2 4 0 0 0 0 0 0 0 0 0
Zeroed Pages 0 0 4 4 4 1 1 1 1 1 2
Node 0, zone Normal 2910 2187 1763 1250 0 0 0 0 0 0 0
Zeroed Pages 0 0 1 167 814 217 118 92 82 71 68
278.58user 24.26system 5:24.01elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+8053981minor)pagefaults 0swaps
System time for compilation has been reduced somewhat which leads to a 1 second win (~0.3%). But
that win is within the margin of potential noise. No benefit should be expected for kernel
building from perzeroing.
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.04s 1.99s 2.00s 96466.979 96485.259 1 3 2 1 0.04s 2.81s 1.05s 68759.082 126692.387 1 3 4 1 0.05s 4.65s 1.04s 41803.677 138548.237 1 3 8 1 0.12s 12.44s 1.09s 15649.165 99233.376 1 3 16 1 0.11s 31.04s 2.05s 6308.209 77879.592 1 3 32 1 0.82s 98.80s 4.00s 1973.422 48432.408 1 3 64 1 9.29s 293.51s 6.02s 649.266 31229.490 1 3 128 1 50.20s 588.48s 7.08s 307.830 25090.215Allocating 4GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.16s 11.13s 11.02s 69608.953 69609.929 4 3 2 1 0.19s 11.91s 6.04s 64954.511 121523.482 4 3 4 1 0.15s 18.81s 5.04s 41470.003 143927.689 4 3 8 1 0.22s 44.42s 6.06s 17611.555 118104.495 4 3 16 1 0.31s 135.85s 9.08s 5775.452 79887.187 4 3 32 1 1.40s 420.54s 15.01s 1863.828 51742.635 4 3 64 1 4.76s 1205.17s 21.06s 649.978 36375.098 4 3 128 1 45.16s 2416.41s 23.04s 319.482 33511.208Allocating 16GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 16 3 1 1 0.77s 64.02s 64.07s 48548.234 48555.210 16 3 2 1 0.73s 70.05s 37.01s 44436.803 84583.807 16 3 4 1 0.80s 79.43s 23.06s 39201.468 133261.169 16 3 8 1 0.70s 207.43s 29.09s 15114.034 105204.451 16 3 16 1 0.89s 510.13s 36.08s 6155.644 85273.863 16 3 32 1 1.19s 1643.69s 58.03s 1912.434 53888.906 16 3 64 1 7.10s 4644.63s 81.02s 676.248 38711.367 16 3 128 1 67.00s 9873.81s 88.06s 316.445 35485.076The fault handler does not scale well over 4 concurrent threads allocating memory.
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.03s 1.99s 2.00s 96698.657 96582.505 1 3 2 1 0.03s 2.64s 1.04s 73183.070 135507.159 1 3 4 1 0.04s 2.79s 0.09s 69042.023 213365.094 1 3 8 1 0.04s 3.08s 0.07s 62718.557 276536.378 1 3 16 1 0.08s 4.83s 0.08s 39953.675 243596.604 1 3 32 1 2.17s 11.01s 1.02s 14904.242 157164.893 1 3 64 1 13.66s 30.31s 2.01s 4470.644 91363.023 1 3 128 1 74.17s 114.09s 4.01s 1044.262 47285.524Allocating 4GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.17s 11.05s 11.02s 69990.094 69987.700 4 3 2 1 0.20s 11.47s 6.02s 67338.915 126201.925 4 3 4 1 0.15s 10.84s 3.03s 71468.408 231644.971 4 3 8 1 0.19s 11.70s 2.03s 66089.949 332308.887 4 3 16 1 0.59s 19.36s 2.04s 39408.172 327444.716 4 3 32 1 1.32s 44.51s 2.09s 17153.878 263079.172 4 3 64 1 7.64s 111.83s 4.01s 6581.771 190825.697 4 3 128 1 40.06s 438.94s 7.03s 1641.789 107286.045Allocating 16GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 16 3 1 1 0.76s 63.62s 64.03s 48856.019 48861.435 16 3 2 1 0.68s 68.13s 36.01s 45709.273 87014.775 16 3 4 1 0.66s 56.34s 17.04s 55176.839 180100.535 16 3 8 1 0.64s 52.07s 10.00s 59668.869 311712.600 16 3 16 1 0.89s 74.79s 8.02s 41562.034 382884.245 16 3 32 1 2.58s 172.80s 9.06s 17935.745 324358.819 16 3 64 1 7.13s 436.28s 12.02s 7094.238 257560.257 16 3 128 1 48.34s 1867.66s 22.06s 1641.819 138871.641Performance increases up to 16 concurrent threads but then drops off likely due to cacheline bouncing which could be addressed by the split rss patches.
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.01s 1.91s 1.09s101885.901 101825.209 1 3 2 1 0.01s 2.08s 1.01s 93640.292 167843.120 1 3 4 1 0.01s 2.17s 0.07s 89757.719 260366.344 1 3 8 1 0.08s 2.50s 0.06s 75914.997 289684.994 1 3 16 1 0.09s 3.06s 0.06s 62233.871 284580.856 1 3 32 1 2.17s 7.97s 1.01s 19367.630 178274.468 1 3 64 1 8.27s 27.37s 1.09s 5515.492 99687.208 1 3 128 1 40.74s 64.23s 3.03s 1872.787 58228.428Allocating 4GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.07s 10.75s 10.08s 72654.816 72656.012 4 3 2 1 0.05s 8.59s 4.07s 90933.383 165518.465 4 3 4 1 0.06s 8.38s 2.07s 93066.684 284662.744 4 3 8 1 0.07s 9.03s 2.00s 86322.862 386625.024 4 3 16 1 0.11s 11.24s 1.09s 69255.768 411677.546 4 3 32 1 1.64s 31.14s 2.05s 23981.001 309606.010 4 3 64 1 11.82s 106.65s 3.09s 6638.031 198125.131 4 3 128 1 42.85s 260.29s 5.08s 2594.230 134793.726Allocating 16GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 16 3 1 1 0.26s 62.51s 62.07s 50106.145 50108.917 16 3 2 1 0.24s 58.35s 31.00s 53682.592 101220.626 16 3 4 1 0.23s 48.93s 16.03s 63984.266 192642.348 16 3 8 1 0.22s 38.13s 7.09s 82017.158 394527.495 16 3 16 1 0.35s 42.59s 6.01s 73249.587 510574.838 16 3 32 1 2.34s 135.53s 8.03s 22815.139 377974.474 16 3 64 1 6.52s 412.73s 11.04s 7503.031 275621.659 16 3 128 1 57.21s 1106.36s 15.09s 2703.487 196780.174The benefits from prefaulting are even greater than the atomic operations but it also only scales well up to 16 concurrent threads. However, this method may generate additional useless page table entries.
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.05s 0.46s 0.05s380579.983 379122.948 1 3 2 1 0.05s 1.43s 0.08s132626.202 228064.767 1 3 4 1 0.05s 4.08s 1.02s 47527.510 153619.452 1 3 8 1 0.05s 11.11s 1.07s 17601.547 113665.070 1 3 16 1 0.35s 27.42s 2.02s 7078.742 87181.716 1 3 32 1 1.21s 88.22s 3.05s 2198.151 55077.202 1 3 64 1 7.21s 233.23s 5.01s 817.672 38532.779 1 3 128 1 51.37s 562.34s 7.02s 320.355 27003.201Allocating 4 GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.19s 3.14s 3.03s235607.454 235447.568 4 3 2 1 0.16s 6.07s 3.05s125966.857 222342.763 4 3 4 1 0.18s 14.77s 4.07s 52572.530 167226.098 4 3 8 1 0.15s 42.85s 6.04s 18283.709 121987.954 4 3 16 1 0.29s 112.35s 8.03s 6981.594 94505.654 4 3 32 1 1.51s 363.61s 13.00s 2153.806 60210.738 4 3 64 1 9.92s 1007.38s 18.03s 773.055 42786.801 4 3 128 1 48.13s 2347.83s 22.05s 328.231 34884.233The benefits here surface even for single thread performance measurements. The shorter execution time of the page fault handler results in some boost to the numbers at a high cpu count but not much.
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.04s 2.22s 2.02s 86891.036 86893.388 1 3 2 1 0.04s 3.36s 1.08s 57686.705 109111.707 1 3 4 1 0.03s 4.17s 1.03s 46646.555 144286.301 1 3 8 1 0.35s 5.32s 1.04s 34604.081 139312.992Allocating 4GB.
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.19s 11.36s 11.05s 68015.711 68006.390 4 3 2 1 0.17s 15.75s 8.05s 49356.828 92052.173 4 3 4 1 0.17s 16.37s 5.02s 47505.073 148577.738 4 3 8 1 0.18s 29.91s 5.01s 26126.788 151435.451
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.04s 2.18s 2.02s 87915.508 87858.256 1 3 2 1 0.04s 3.21s 1.07s 60277.400 113595.465 1 3 4 1 0.03s 3.31s 1.01s 58695.788 164231.902 1 3 8 1 0.03s 3.54s 1.00s 55007.262 180576.786
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.17s 11.35s 11.05s 68205.814 68193.327 4 3 2 1 0.17s 16.90s 8.09s 46064.868 88107.475 4 3 4 1 0.17s 13.21s 4.02s 58742.872 183102.224 4 3 8 1 0.16s 15.61s 3.04s 49836.375 227616.222Page fault performance increeases around 50% at 8 cpus.
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 1 1 1 0.01s 0.12s 0.01s459653.380 459109.663 1 1 2 1 0.01s 0.26s 0.01s235470.809 383970.576 1 1 4 1 0.01s 0.33s 0.02s189039.428 313930.323 1 1 8 1 0.23s 0.53s 0.03s 85380.248 164070.448 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 1 1 1 0.01s 0.12s 0.01s466037.092 464177.888 1 1 2 1 0.01s 0.31s 0.01s201528.933 340726.583 1 1 4 1 0.02s 0.75s 0.02s 83781.414 219356.554 1 1 8 1 0.00s 0.15s 0.01s404271.200 404814.488 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.04s 0.85s 0.08s218595.663 218569.383 1 3 2 1 0.05s 0.94s 0.06s197379.160 304787.346 1 3 4 1 0.04s 0.89s 0.07s209497.100 268058.928 1 3 8 1 0.04s 1.74s 0.08s109894.459 241376.586 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 1 1 1 0.05s 0.62s 0.06s387353.215 385558.099 4 1 2 1 0.06s 1.84s 1.01s136956.901 227575.591 4 1 4 1 0.05s 3.20s 1.01s 80225.757 227674.193 4 1 8 1 0.05s 5.28s 1.02s 49101.036 202901.977 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 1 1 1 0.05s 0.62s 0.06s384578.370 384956.712 4 1 2 1 0.07s 2.23s 1.02s113503.379 204275.877 4 1 4 1 0.04s 3.05s 1.03s 84493.351 200452.447 4 1 8 1 0.06s 2.21s 1.00s114863.296 257988.540 Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.20s 3.46s 3.06s214633.864 214652.933 4 3 2 1 0.15s 6.14s 3.08s124718.309 204314.102 4 3 4 1 0.16s 9.49s 3.07s 81409.832 211840.077 4 3 8 1 0.20s 13.20s 4.00s 58657.298 193572.915Page fault performance increases up to a factor of 5. The improvements hold up to a higher cpu count if enough zeroed memory is available.
Allocating 1GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 3 1 1 0.01s 2.11s 2.01s 92224.730 92186.501 1 3 2 1 0.01s 3.17s 1.06s 61586.576 116063.254 1 3 4 1 0.01s 3.12s 1.02s 62582.064 161867.866 1 3 8 1 0.01s 3.16s 1.01s 61946.646 178650.872Allocating 4GB
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 4 3 1 1 0.04s 11.05s 11.01s 70808.584 70803.038 4 3 2 1 0.05s 16.60s 8.07s 47201.572 90122.438 4 3 4 1 0.07s 11.74s 3.09s 66532.220 196734.678 4 3 8 1 0.30s 13.59s 3.01s 56584.180 247731.744Performance improvements are similar to the atomic page table operations patch.
200 Megabyte allocation. No patches. Single thread.
Mb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 200 3 1 1 0.00s 0.17s 0.01s844090.542 845814.978 200 3 1 1 0.00s 0.17s 0.01s853466.095 853376.002 200 3 1 1 0.00s 0.17s 0.01s853466.095 855138.626 200 3 1 1 0.00s 0.17s 0.01s848750.083 847934.815 200 3 1 1 0.00s 0.17s 0.01s848754.773 848248.554 200 3 1 1 0.00s 0.17s 0.01s844085.903 843914.312 200 3 1 1 0.01s 0.16s 0.01s848750.083 850573.694 200 3 1 1 0.00s 0.17s 0.01s844090.542 845451.843600 Megabyte allocation. No patches. Single thread:
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 600 3 1 1 0.01s 0.49s 0.05s851783.509 851389.104 600 3 1 1 0.01s 0.47s 0.04s886129.730 886490.815 600 3 1 1 0.02s 0.45s 0.04s908101.462 908138.994
200 Megabyte allocation with prezeroing. Single thread. Assurance that there is enough prezeroed memory available before the test to handle the complete allocation without additional zeroing.
Mb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 200 3 1 1 0.00s 0.02s 0.00s3756674.275 3712403.061 200 3 1 1 0.00s 0.03s 0.00s3488295.668 3501888.597 200 3 1 1 0.00s 0.03s 0.00s3407159.305 3420844.913600 Megabyte allocation that requires rezeroing while the test is running.
Mb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 600 3 1 1 0.01s 0.10s 0.03s3407167.058 1103645.811 600 3 1 1 0.01s 0.11s 0.04s3433763.088 956299.303 600 3 1 1 0.01s 0.10s 0.03s3516186.590 1101432.893Performance improves fourfold if enough prezeroed memory is available. Allocation speed is still better even if scrubd needs to zero pages in the background. Note that the fault rates are much higher than on ia64. This is due to the smaller page size of 4k against the ia64 page size of 16k.
600 Megabyte allocation:
Mb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 600 3 1 1 0.02s 0.47s 0.04s882570.432 882464.095 600 3 1 1 0.02s 0.47s 0.05s875540.423 875338.122 600 3 1 1 0.02s 0.47s 0.04s884348.279 884415.911200 Megabyte allocation:
Mb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 200 3 1 1 0.00s 0.15s 0.01s904367.368 902906.876 200 3 1 1 0.00s 0.15s 0.01s904361.784 906107.149 200 3 1 1 0.00s 0.15s 0.01s904361.784 903307.763There is no material change from the numbers listed above for 600MB allocations without prezeroing. The numbers for 200Mb improve by 5-8%.
Mb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 200 3 1 1 0.00s 0.17s 0.01s848750.083 847902.050 200 3 1 1 0.00s 0.17s 0.01s853466.095 853565.693 200 3 1 1 0.00s 0.17s 0.01s848750.083 849595.115 200 3 1 1 0.00s 0.17s 0.01s848754.773 848628.162 200 3 1 1 0.00s 0.17s 0.01s844090.542 844554.657 200 3 1 1 0.00s 0.17s 0.01s848754.773 848590.654 200 3 1 1 0.00s 0.17s 0.01s848750.083 848792.295 200 3 1 1 0.00s 0.17s 0.01s844090.542 846406.904600 MB allocation
Mb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 600 3 1 1 0.00s 0.16s 0.01s856766.858 854672.330 600 3 1 1 0.00s 0.16s 0.01s851784.873 851057.402 600 3 1 1 0.00s 0.16s 0.01s856766.858 853173.979No impact in single user mode.
LMBench output may be found here. The test shows no significant performance regressions. Some system operations are accellerated for page zeroing.
The LMBench output is available here
The data shows basically that the performance of all tests stays similar. There seems to be a minor slowdown in some tests for prezeroing and some minor improvements for atomic page table operations.
Description of the implementation of multiple page support in HP-UX.
Suggestions on how to use the features of the Itanium MMU for Linux.
Navarro develops methodologies to elevate single pages to superpages in order to reduce TLB use.