SGI
Open Source
Lockmeter FAQ

What is lock metering?
What data does the kernel capture?
What does lock metering say about 2.3.x kernels?
What does lock metering say about 2.4.x kernels?
What are Lockmeter's known bugs or limitations?


What is lock metering?

Lock metering is the kernel's capturing of data, at runtime, about its own usage of spinlocks. By "spinlocks" we mean both simple spinlock_t spinlocks and rwlock_t single-writer/multiple-reader locks.

What data does the kernel capture?

When lock metering is compiled into the kernel and is subsequently turned on, the kernel captures (on a per-spinlock and per-caller and per-cpu basis) the count of the number of times the lock was acquired; whether the acquisition was immediately obtained or whether it required waiting for the current owner to release it; the cumulative amount of time the lock was held, and the maximum time the lock was held; the cumulative amount of time someone waited for the lock to be released, and the maximum wait-time.

What does lock metering say about 2.3.x kernels?

Using one particular multiprocess workload (specifically, AIM7 without the three synchronous disk subtests, which produces a compute-bound workload with about 75% user and 25% system time) running on a 4xCPU 500MHz Xeon SMP, the 2.3.28 performed about 3% faster than a 2.2.13 kernel at the highest loads. The 2.3.28 kernel did almost 2x the spinlocks-per-second vs. the 2.2.13 kernel, due to the finer granularity of 2.3's locking schemes, but 2.3 exhibited contention on only 2% of those spinlock calls, vs. 16% in 2.2. When 2.3 did contend, the mean wait-times were almost 2x those in 2.2.

With this workload on this 4xCPU hardware, spinlock contention in 2.2.13 consumed about 8% of theoretically available CPU cycles (325 milliseconds/second of waiting per 4,000 milliseconds/second theoretically available) vs. about 4% in 2.3.28 (160 milliseconds of waiting per 4,000 milliseconds). In other words, for this workload, 2.3.28 consumes about half the spinlock wait cycles as does 2.2.13.

The kernel_flag usage is still significant in 2.3, but its contention is greatly reduced. That workload saw contention in 2.2.13 on 42% of the 48K/second kernel_flag acquisitions, vs. 12% of the 19K/second acquisitions done in the 2.3.28 kernel.

In 2.3.28 the biggest hold-time hotspot is in do_close(), holding kernel_flag for a mean of 59usec and a max of 15msec.

What does lock metering say about 2.4.x kernels?

Lock contention bottlenecks continue to lessen in 2.4.x.

A cooperative project, the Linux Scalability Effort has been formed to focus on large-system performance issues. One work-in-progress focuses on the CPU Scheduler. The current 2.4.x Scheduler uses a single runqueue and a common global runqueue_lock that typically sees high contention in environments characterized by high context-switching rates. Out of that effort has come the Multiple Runqueue patch to replace the single runqueue with per-CPU runqueues with per-CPU locks.

Other workloads, like AIM7, have identified the pagecache_lock as the major bottleneck in ext2 filesystem-intensive workload environments. Ingo Molnar maintains an effective patch to eliminate this bottleneck. AIM7 on a mips64 Origin2000 platform running the baseline 2.4.2 kernel shows that performance begins to flatten out above four CPUs, and at 32-CPUs the performance is only about 6x a 1-CPU performance. With that pagecache_lock patch that same workload scales much more linearly up to about 28 CPUs, reaching about 19x a 1-CPU performance. At that level, Lockmeter identifies two more bottlenecks: the kernel_flag, especially as used by ext2_get_block(); and various waits on the pagemap_lru_lock.

What are Lockmeter's known bugs or limitations?

The most visible Lockmeter shortcoming is the rapid filling of a data structure that records read-lock data. The lockstat command reports this as an overflow. No data gets corrupted, just some read-lock data gets thrown away. The overflow occurs because the table gets filled with data about dynamically created -- and destroyed -- defunct rwlock_t locks.

A fix for this problem is being worked on. Meanwhile, one workaround is to increase the size of the constant LSTAT_MAX_READ_LOCK_INDEX in the file include/linux/lockmeter.h. Feel free to change the value from 1000 to 2000, or even higher, so see if data from your particular workload can be contained in a larger table. You cannot increase this constant indefinitely, of course, since the data structure is allocated from a limited amount kmalloc() space.

Another limitation is with Lockmeter performance. Ideally, any runtime instrumentation should have zero cost and therefore have no effect on the behavior of the system you endeavor to measure. In the real world, of course, built-in instrumentation invariably perturbs the system being measured. Linux spinlocks are usually implemented in short, efficient assembler sequences of a few inline instructions. Lockmetered spinlocks are implemented as procedure calls that are invariably slower than a few inline instructions, even when Lockmeter data gathering is turned off. However, care has been taken to minimize the performance impact of Lockmetering, and additional work is in progress to reduce the Lockmeter overhead even further. In general, you can build the Lockmeter functionality into the kernel, leave it turned off until workload appears that you wish to instrument, and suffer a performance degradation of only a few percent in kernel performance.