Linux Comprehensive System Accounting (CSA) Kernel Design Document Marlys Kohnke May 4, 2000 Revised November 10, 2000 Revised March 7, 2001 Jay Lan Revised June 29, 2004 1.0 Introduction ----------------- Linux has GNU process accounting, which is similar to SysV and IRIX basic process accounting. The set of Linux GNU accounting commands differs from SysV/IRIX. There is no disk accounting. The Linux kernel itself monitors the size of the system accounting file (pacct file) and takes appropriate action. SysV/IRIX does this outside of the kernel. Linux CSA is similar to IRIX CSA (which is very similar to Cray accounting). Linux CSA provides a set of user and administrator commands which process raw accounting data and combine this data into jobs within system uptime periods. More resource usage counters have been added to Linux. This document describes the CSA kernel infrastructure and the CSA job accounting kernel code (which can be compiled into the kernel or compiled as a loadable module). The Linux kernel we currently support are linux-2.4 and linux-2.6. Patches available for download are: Linux 2.4: linux-2.4.25.csa.patch linux-2.4.26.csa.patch Linux 2.6: linux-2.6.5.csa.patch linux-2.6.6.csa.patch linux-2.6.7.csa.patch 2.0 Current Linux Design ------------------------- Currently, Linux has GNU process accounting which provides the following per process resource usage information: . cpu user time . cpu system time . memory integral . i/o characters transferred . blocks read and written . minor pagefaults . major pagefaults . swaps (not supported since Linux 2.6.6) . exit status This is similar to the IRIX basic process accounting. However, the memory integral (memory use over cpu time intervals) on Linux GNU is a one time only snapshot of the amount of memory used at process exit. The characters and blocks read and written are hardcoded to zero. The other information is gathered from the current task struct. Note that "page swapped" info was removed from the task struct in Linux 2.6.6. This usage information is written in a 16-bit compressed format (3-bit base 8 exponent and a 13-bit fraction). The cpu usage time is in clock ticks. The memory value is in 1K units. 3.0 New Linux CSA Design ------------------------ The kernel/csa.c code can be compiled directly into the kernel or compiled as a loadable module. This code contains the procedures which process CSA configuration and status requests, CSA binary records write requests and /proc ioctls requests. 3.1 PAGG Job Interface ---------------------- The csa.c code also registers procedure callback names with the PAGG job code. Those callbacks are for writing start-of-job and end-of-job accounting records. The PAGG job code (either builtin or loaded as a module) must be available for this registration to succeed. So, if the PAGG job code and the CSA module code are compiled as modules, the PAGG job module must be loaded before the CSA module. The csa_jstart() and csa_jexit() procedures will be called from within the PAGG job code. These two procedures fill in and write start-of-job and end-of-job accounting records. 3.2 CSA Job Accounting ---------------------- A configuration option, CONFIG_CSA or CONFIG_CSA_MODULE, is used to control when CSA job accounting is compiled into the kernel. If job accounting is not enabled, the following procedure definitions will be used: #define csa_update_integrals(void) do { } while (0); #define csa_clear_integrals(task) do { } while (0); #define csa_acct(exitcode, task) do { } while (0); If job accounting is enabled, the csa_update_integrals() inline procedure will calculate the physical and virtual memory usage over time integrals and store the values in the current task structure. The csa_acct() procedure will be called from within do_exit(). This procedure is a wrapper for a procedure in CSA job accounting code, which does the work of filling in and writing an accounting record. 3.3 Writing Accounting Records ------------------------------ As noted in an earlier section, the end-of-process accounting record information is gathered from task struct fields. A task struct pointer is passed from the do_exit() code. The following conditions will prevent an accounting record from being written: 1) If the process is not part of a job, no record is written. 2) If CSA is configured off, no record is written. The csa record type can be enabled, but the memory or i/o record types can be disabled. That limits which continuation records are written. 3) CSA can be configured with threshold values for cpu time and amount of virtual memory used. If the process resource usage for either of these resources is less than the threshold, then an accounting record is not written. 4) If the daemon accounting type is not enabled, the daemon accounting record is not written. However, if user job accounting for a specific job has been enabled (via the ja command), accounting records are always written to the user job accounting file regardless of the CSA configuration. When a new job is created, the PAGG job code passes information to the CSA code which includes the job id, user id and job start time. This information is used to write a start-of-job accounting record. When the last process within a job exits, the PAGG job code sends information so that an end-of-job accounting record can be written. This information includes the job id, user id, group id, nice value, and job start and end times. In the future, the job physical and virtual highwater memory values will also be passed from the PAGG code (the highest value of any one process within the job). 3.4 Ioctl Requests Processing ----------------------------- The CSA kernel functions are accessed by user spaced process via a library interface, which currently will interface to kernel through /proc ioctl calls. The /proc ioctl requests provide functions to check the status of CSA; start and stop process, daemon and record accounting; start and stop user job accounting (the CSA ja command); and write daemon accounting records. The record accounting types are memory and I/O. The kernel and daemon accounting types supported are csa, nqs, tape and workload management. The CAP_SYS_PACCT capability is required to start and stop CSA and to write daemon accounting records. 3.5 Semaphores -------------- The kernel code uses two semaphores. One semaphore is used around checking the CSA configuration before writing an accounting record and when changing the CSA configuration. The second semaphore is used when closing or changing the system pacct file (CSA binary records file) or a user job accounting file and when writing an accounting record. 4.0 Accounting Record Structures -------------------------------- In order to accomodate processing accounting records from a cluster of 32 bit and 64 bit machines, the accounting record structures will be kept the same size across these machines. 4.1 Base Record --------------- Linux CSA base records have the following fields; struct achead ac_hdr1; /* Header */ struct achead ac_hdr2; /* 2nd header for continued records */ double ac_sbu; /* System billing units */ unsigned int ac_stat:8; /* Exit status */ unsigned int ac_nice:8; /* Nice value */ unsigned char ac_sched; /* Scheduling discipline */ unsigned int :8; /* Unused */ uint32_t ac_uid; /* User ID */ uint32_t ac_gid; /* Group ID */ uint64_t ac_ash; /* Array session handle; unused */ uint64_t ac_jid; /* Job ID */ uint64_t ac_prid; /* Project ID; unused */ uint32_t ac_pid; /* Process ID */ uint32_t ac_ppid; /* Parent process ID */ time_t ac_btime; /* Beginning time [sec since 1970] */ char ac_comm[16]; /* Command name */ uint64_t ac_etime; /* Elapsed time [usecs] */ uint64_t ac_utime; /* User CPU time [usec] */ uint64_t ac_stime; /* System CPU time [usec] */ uint64_t ac_spare; /* Spare field */ uint64_t ac_spare1; /* Spare field */ The values for most of these fields will come from the current task struct. The ac_sched value will be current->policy. That policy field is an unsigned long, but there are only three scheduling policies defined now. An unsigned char field for ac_sched should be sufficient. 4.2 I/O Record -------------- Linux CSA I/O continuation records will have the following fields: struct achead ac_hdr; /* Header */ double ac_sbu; /* System billing units */ uint64_t ac_bwtime; /* Block I/O wait time [usecs] */ uint64_t ac_rwtime; /* Raw I/O wait time [usecs]; unused */ uint64_t ac_chr; /* Number of chars (bytes) read */ uint64_t ac_chw; /* Number of chars (bytes) written */ uint64_t ac_bkr; /* Number of blocks read */ uint64_t ac_bkw; /* Number of blocks written */ uint64_t ac_scr; /* Number of read system calls */ uint64_t ac_scw; /* Number of write system calls */ uint64_t ac_spare; /* Spare field */ These fields are added to the task struct and initialized to zero: unsigned long int rchar, wchar, rblk, wblk, syscr, syscw, bwtime There are many file systems ported to Linux. Rather than add counters to each file system, the counters will be in filesystem independent parts of the kernel (either before or after the filesystem specific code). This will also automatically handle new file systems added later. Async I/O is currently not available in Linux. If this is provided some day, counters will need to be added in that code. There currently aren't any kernel counters to provide this i/o information. Here's where kernel counters have been added: 1) ac_bwtime: This field will hold the amount of block i/o wait time in microseconds. The current process will be charged for all block write activity and wait time when cached blocks are written to disk even though all of those blocks do not necessarily belong to that process. drivers/block/ll_rw_blk.c: increment current->bwtime in clock ticks for wait time (this will be converted to microseconds when the record is written); incremented in __get_request_wait() 2) ac_rwtime: raw i/o wait time; deferred; currently not applicable 3) ac_chr: This field will hold the number of bytes (characters) read via read, pread and readv syscalls. Put the counter in each syscall since there isn't a common lower level routine for the varying file systems until down to the device level. linux/fs/read_write.c: add ret bytes to current->rchar in sys_read(), sys_pread() and sys_readv() 4) ac_chw: This field will hold the number of bytes (characters) written via write, pwrite and writev syscalls. Handled similar to #3. linux/fs/read_write.c: add ret bytes to current->wchar in sys_write(), sys_pwrite() and sys_writev() 5) ac_bkr: This field will hold the number of blocks read by this process. This counter is added to code common to all file systems. This value will be converted to 4K blocks when the accounting record is written. linux/drivers/block/ll_rw_blk.c: add nr_sectors to current->rblk in drive_stat_acct() 6) ac_bkw: This field will hold the number of blocks written while this process was the current executing process. Because of caching, this number of blocks can, and likely will, include blocks not associated with the current process. This needs to be made clear in the documentation that the number of blocks written doesn't necessarily directly correlate with the process' own I/O operations. linux/drivers/block/ll_rw_blk.c: add nr_sectors to current->wblk in drive_stat_acct() 7) ac_scr: This field will hold the number of read system calls, including read, pread, and readv. Add a counter to each syscall code. linux/fs/read_write.c: Increment current->syscr in sys_read(), sys_readv(), sys_pread() 8) ac_scw: This field will hold the number of write system calls, including write, pwrite, and writev. Add a counter to each syscall code. linux/fs/read_write.c: Increment current->syscw in sys_write(), sys_writev(), sys_pwrite() 4.3 Memory Record ----------------- Linux CSA memory continuation records have the following fields: struct achead ac_hdr; /* Header */ double ac_sbu; /* System billing units */ struct memint ac_core; /* Core memory integrals */ struct memint ac_virt; /* Virtual memory integrals */ uint64_t ac_pgswap; /* Number of pages swapped */ uint64_t ac_minflt; /* Number of minor page faults */ uint64_t ac_majflt; /* Number of major page faults */ uint64_t ac_spare; /* Spare field */ struct memint: uint64_t himem; /* High-water memory usage value [Kbytes] */ uint64_t mem1; /* Memory integral 1 [Mbytes/uSec] */ uint64_t mem2; /* Memory integral 2 [Mbytes/uSec]; unused */ uint64_t mem3; /* Memory integral 3 [Mbytes/uSec]; unused */ These fields are added to the task struct and initialized to zero, when configuration option CONFIG_CSA or CONFIG_CSA_MODULE is defined: unsigned long int csa_rss_mem1, csa_vm_mem1; /* memory integrals */ clock_t csa_stimexpd; /* sum of stimes at last integral calc */ These fields are added to the mm_struct and initialized to zero: unsigned long hiwater_rss, hiwater_vm; There are currently kernel counters for the number of page swaps and number of minor and major page faults. The highwater counters for physical and virtual memory have been added. The two fields being added to the task structure are used to calculate the memory integral. Both memory integrals will be calculated whenever either physical or virtual memory changes happen. 1) ac_core.himem: This field holds the physical memory highwater value. Check this value and reset it when physical memory increases. ac_core.mem1: This field holds the physical memory integral value. Calculate this value when physical or virtual memory changes. ac_virt.himem: This field holds the virtual memory highwater value. Check this value and reset it when virtual memory increases. ac_virt.mem1: This field holds the virtual memory integral value. Calculate this value when virtual or physical memory changes. include/linux/mm.h (expand_stack) linux/kernel/exec.c (do_execve) linux/kernel/exit.c (do_exit) linux/mm/memory.c (zap_page_range, do_swap_page, do_no_page, do_wp_page, do_anonymous_page) linux/mm/vmscan.c (try_to_swap_out) linux/mm/swapfile.c (unuse_pte) linux/include/mm.h (expand_stack) linux/mm/mmap.c (do_mmap_pgoff, do_brk) linux/mm/mremap.c (move_vma, do_mremap) 4.4 Configuration Record ------------------------- A CSA configuration accounting record is written whenever the CSA configuration is changed. This record contains the following fields: struct achead ac_hdr; /* Header */ unsigned int ac_kdmask; /* Kernel and daemon config mask */ unsigned int ac_rmask; /* Record configuration mask */ int64_t ac_uptimelen; /* Bytes from the end of the boot record to the next boot record */ ac_eventtype ac_event:8; /* Accounting configuration event */ unsigned int :24; /* Unused */ time_t ac_boottime; /* System boot time [secs since 1970]*/ time_t ac_curtime; /* Current time [secs since 1970] */ struct ac_utsname ac_uname; /* Condensed uname information */ The configuration event types are: AC_CONFCHG_BOOT, /* Boot time (always first) */ AC_CONFCHG_FILE, /* Reporting pacct file change */ AC_CONFCHG_ON, /* Reporting xxx ON */ AC_CONFCHG_OFF, /* Reporting xxx OFF */ AC_CONFCHG_INC_DELTA, /* Report incremental acct clock delta change */ AC_CONFCHG_INC_EVENT, /* Report incremental accounting event */ AC_CONFCHG_MAX