|
|
||||
|
Process Scheduling and Memory Placement - CpuMemSet Design Notes
Revised and released again to the public domain, except for portions marked subject to the terms and conditions of the GNU General Public License, by Paul Jackson, SGI, 8 October 2001, 31 October 2001, 14 November 2001, 21 December 2001, and June 18, 2002. Objective
CpuMemSets provide an API and implementation for Linux that will support the Processor and Memory Placement semantics required for optimum performance on Linux NUMA systems. These Notes describe the kernel system mechanisms and interfaces appropriate to support process memory placement and CPU scheduling affinity. This is the capability to place task scheduling and memory allocation onto specific processors and memory blocks and to select and specify the operating parameters of certain related kernel task scheduling and memory allocation policies.
Attempts to put cpusets, nodesets, topologies, various scheduling policies, quads, and such into the kernel are typically trying to place too much complexity, or too much policy, or perhaps excessively vendor specific details, into the kernel. The cpus_allowed bit vector (added by Ingo for the Tux kernel-based web server, and used in the task scheduling loop to control which CPUs a task can run on) is a step in the right direction, but exposes an application to the specific details of which system CPU numbers it is running on, which is not quite abstract or virtual enough. In particular, the application focused logic to use certain application CPUs or memories assigned to it in various ways (run this task here, put that shared memory region there) is confused with the system administrators decision to assign a set of resources (CPU and memory) to a job. Also, cpus_allowed is limited to 32 or 64 CPUs, and will need to change for larger systems. Such a change should not impact the primary kernel API used for manage CPU and memory placement for NUMA and other large systems.
There are several other proposals and implementations addressing these same needs. Some of these proposals have a tendency, in this authors view, to attempt to address: This proposal adds some additional structure, with some generic and flexible interfaces designed to separate and isolate the diverse and conflicting demands on the design, so that for example the requirements for hot swapping CPUs don't impact on the application API, or the requirements to support existing legacy API's don't impact on the details of critical allocation and scheduling code in the kernel. This implementation proposes to add two layers, cpumemsets and cpumemmaps, in the following structure:
If you find that the current CpuMemSets API is better suited for expressing your applications processor and memory placement needs than anything else available, good. But if you find the API to be too cumbersome and primitive or otherwise ill suited for convenient use by your application, then find or develop a decent library and API that is easier to use in your circumstances. Hopefully, that library and API (will) depend on the CpuMemSets API and kernel mechanism.
By default, this Map includes all CPUs and memory blocks, and this Set allows scheduling on all CPUs and allocation on all blocks. A hook is provided to allow for an architecture specific routine to initialize this Map and Set. This hook could be used to properly sort the kernel cpumemset memory lists so that initial kernel data structures are allocated on the desired nodes. An optional kernel boot parameter causes this initial Map and Set to include only one CPU and one memory block, in case the administrator or some system service will be managing the remaining CPUs and blocks in some specific way. This boot parameter is provided to the above hook for the use of the architecture specific initialization routine. As soon as the system has booted far enough to run the first user process, init(1M), an early init script may be invoked that examines the topology and metrics of the system, and establishes optimized cpumemmap and cpumemset settings for the kernel and for init. Prior to that, various kernel deamons are started and kernel data structures allocated, which may allocate memory without the benefit of these optimized settings. This reduces the amount of knowlege that the kernel need have of special topology and distance attributes of a system, in that the kernel need only know enough to get early allocations placed correctly. More esoteric topology awareness can be kept in userland. System administrators and services with root privileges manage the initial allocation of system CPUs and memory blocks to cpumemmaps, deciding which applications will be allowed the use of which CPUs and memory blocks. They also manage the cpumemset for the kernel, which specifies what order to search for kernel memory, depending on which CPU is executing the request. For an optimal system, the cpumemset for the kernel should probably sort the memory lists for each CPU by distance from that CPU. Almost all ordinary applications will be unaware of CpuMemSets, running in whatever CPUs and memory blocks their inherited cpumemmap and cpumemset dictate. But major multi-processor applications can take advantage of CpuMemSets, probably via existing legacy API's, to control the placement of the various processes and memory regions that the application manages. Emulators for whatever API the application is using will convert these requests into cpumemset changes, which will provide the application with detailed control of the CPUs and memory blocks provided to the application by its cpumemmap. On systems supporting hot-swap of CPUs (or even memory, if someone can figure that out) the system administrator would be able to change CPUs and remap by changing the applications cpumemmap, without the application being aware of the change.
The role of an Application Architect in this regard is to specify for a given application the details of just which CPUs and memory available are used to schedule which tasks, and to allocate which memory. The System Manager is managing a particular physical computer system, preferring to remain relatively oblivious to the inards of applications, and the Application Architect is managing the details of task and memory usage within a single application, preferring to ignore the details of the particular system being used to execute the application. The System Manager does not usually care whether the application puts two particular threads on the same CPU or different, and the Application Architect does not care whether that CPU is number 9 or number 99 in the system.
We cover some basic notions of distance here, and anticipate that Dorwin's work done in concert with CpuMemSets will support these notions. We cover this here, even though it is not part of CpuMemSets, because it is usually involved in attempts to solve these needs, and we want to be clear that we recognize its importance, even though it is separate from this design. The kernel provides information, via /proc, of the number of CPUs and memory blocks, and of the distance between them, so that sufficiently intelligent system administrators and services can assign "closely" placed CPUs and memory blocks (perhaps all on the same node or quad) to the same cpumemset, for optimal performance. But the kernel has no notion (for the purpose of CpuMemSets) of topology, nodes or quads., with the possible exception of architecture specific code that sets up the initial kernel cpumemmap and cpumemset. Nor does the kernel task scheduler or memory allocation code pay any attention to this distance, with the possible exception of more dynamic scheduler or allocator mechanisms, distinct from CpuMemSets. The kernel just reports topology and distances to the user code. Processors are separate, parallel scheduled, general purpose execution units. Memory blocks are partition classes of physical general purpose system ram, such that any two distinct locations within the same node are the same distance from all processors, and for any two separate blocks, there is typically at least one processor such the two blocks are at a different distance from that processor. The distance from a given processor to a given memory block is a scalar approximation of that memory's latency and bandwidth, when accessed from that processor. The longer the latency and the lower the bandwidth, the higher the distance. For Intel IA64 systems, we expect to make use of the ACPI support for distances, and to use a distance metric that is scaled to make the closest <processor, memory> pair be at a distance of 10 from each other. Not all the processing or memory elements visible on the system bus are general purpose. There may be I/O buffer memory, DMA engines, vector processors and frame buffers. We might care about the distance from any processing element, whether a general purpose CPU or not, to any memory element, whether system RAM or not. In addition to <CPU, mem> distances, we also require <CPU, CPU> distances. The <CPU, CPU> distance is a measure of how costly it would be due to caching affects to change the current CPU on which a task is executing (and has considerable cache presence) to the other CPU. These distances reflect the impact of the system caches - two processors sharing a major cache are closer. The scheduler should be more reluctant to reschedule to a CPU further away, and two tasks communicating via shared memory will want to stay on CPUs that are close to each other, in addition to being close to the shared memory. On most systems, it is probably not worth attempting to estimate how much presence a task might have in the caches of the CPU it most recently ran on. Rather, the scheduler should simply be reluctant to change the CPU on which a task is scheduled, perhaps with reluctance proportional to the <CPU, CPU> distance. For larger systems having relatively (to bus speed) faster CPUs relying more heavily on the caches, it will become worthwhile to include an estimate of cache occupancy when deciding whether to change CPU. The ACPI standard describes an NxN table of distances between N nodes, under the assumption that a system consists of several nodes, each node having some memory and one or a few CPUs, with all CPUs on a node equidistant from all else. Kanoj, as part of the LinuxScalabilityEffort, has proposed a PxP distance vector between any two of P processors. The above provides P distinct M-length distance vectors, one for each processor, giving the distance from that processor to each of M Memory blocks, and P distinct P-length distance vectors for each processor, giving the distance from that processor to each of the P processors. The implementation should be based on ACPI where that is available, and derive what else is needed from other, potentially architecture specific detail.
CpuMemSets provides the System Administrator substantial control over system processor and memory resources with out the attendant inflexibility of hard partitions.
Similarly, all memory allocation is constrained by the cpumemmap and cpumemset associated to the kernel or vm area requesting the memory, except for specific requests within the kernel. The Linux page allocation code has been changed to search only in the memory blocks allowed by the vm area requesting memory. If memory is not available in the specified memory blocks, then the allocation must fail or sleep, awaiting memory. The search for memory will not consider other memory blocks in the system. It is this "mandatory" nature of cpumemmaps and cpumemsets that makes it practical to provide many of the benefits of hard partitioning, in a dynamic single system image environment.
Rather the following scenarios provide examples of how to attach cpumemmaps to major system services.
Each cpumemset has an associated cpumemmap. When changing a cpumemmap, you select which one to change by specifying the same choices and related parameters (optional virtual address or pid) as when changing a cpumemset. After changing a single cpumemmap with a cmsSetCMM() call, then that cpumemmap will no longer be shared by any other cpumemset. Only the cpumemset you went through to get to the cpumemmap will have a reference to the new changed cpumemmap. It would be an error if the changed cpumemmap didn't supply enough CPUs or memory blocks to meet the needs of the single cpumemset using it.
Presumably, the inherited memory lists will be most often be sorted to provide memory close to the faulting cpu. But it is the responsibility of the system administrator or service to determine this, not the typical application. Applications that have some specific memory access pattern for a particular address range may want to construct memory lists to control placement of that memory.
In other words, cpumemmaps and cpumemsets can be "named" with the following tuples: Had there been an important need, cpumemmaps and cpumemsets could have been made a separately named, allocated and protected system resource. But this would have required additional work, a more complex API, and more software. No compelling requirement for naming CpuMemSets has been discovered, so far at least. Granted, this has been one of the more surprising aspects of this Design.
If you construct your own cpumemmap or cpumemset, using some other memory layout, don't pass that to cmsFree*(). You may alter in place and replace malloc'd elements of a cpumemmap or cpumemset returned by a cmsQuery*() call, and pass the result back into a corresponding cmsSet*() or cmsFree*() call. You will have to explicitly free(3) any elements of the data structure that you disconnect in this fashion, to avoid a memory leak.
/*
* CpuMemSets Library
* Copyright (C) 2001 Silicon Graphics, Inc.
* All rights reserved.
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Library General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This library is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Library General Public License for more details.
*
* You should have received a copy of the GNU Library General Public
* License along with this library; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 02111-1307 USA.
*/
/*
* cpumemsets.h - CpuMemSet application interface for managing
* system scheduling and memory allocation across
* the various CPUs and memory blocks in a system.
*/
#ifndef __CPUMEMSET_H
#define __CPUMEMSET_H
/*
* The CpuMemSet interface provides general purpose processor and
* memory placement facilities to applications, system services
* and emulations of other CPU and memory placement interfaces.
*
* It is not the objective that CpuMemSets provide the various
* placement policies and scheduling heuristics needed for
* efficient operation of SMP and NUMA systems. Nor are
* CpuMemSets intended to replace existing API's that have been
* developed to provide such solutions, on Linux and other
* vendor systems. Rather it is the objective of CpuMemSets
* that they provide a common Linux kernel mechanism suitable
* to support the implementation of various such solutions and
* provide emulations of existing API's, with minimal impact
* on existing (or future) kernel scheduling and allocator code.
*
* CpuMemSets were born of the following realizations:
*
* 1) The kernel should support scheduling choice and memory
* placement mechanisms sufficiently generic and policy
* neutral to support a variety of solutions and policies,
* without having to make constant kernel changes.
*
* 2) There are too many existing and anticipated solutions to
* these static scheduling and placement problems to all fit
* in the kernel, so one kernel mechanism is needed to support
* them all.
*
* 3) The ongoing rate of evolution of the more dynamic aspects
* of scheduling and allocation mandate that the relatively
* static aspects addressed by CpuMemSets be kept "off to the
* side" as much as possible, with very minimal impact on the
* existing, or future, scheduling and allocation code.
*
* 4) In the long run, it is untenable to have separate scheduling
* or allocation kernel code (as in the current mm/numa.c)
* for large systems (multiple memory blocks or NUMA), as
* opposed to single memory systems. Maintenance costs go
* up, bugs are introduced, fixes to one are missed in the
* other, and semantics diverge gratuitously. Rather one
* body of kernel code is required, which is optimal (both
* readability and performance) for normal systems but still
* entirely suitable for large systems.
*
* CpuMemSets are implemented using two somewhat separate layers.
*
* 1) cpumemmap (cmm):
*
* The bottom layer provides a simple pair of maps, mapping
* system CPU and memory block numbers to application CPU
* and memory block numbers. System numbers are those used
* by the kernel task scheduling and memory allocation code,
* and typically include all CPU and memory in the system.
* Application numbers are those used by an application in
* its cpumemset to specify its CPU and memory affinity
* for those CPU and memory blocks available in its map.
* Each process, each virtual memory area, and the kernel has
* such a map. These maps are inherited across fork, exec
* and the various ways to create vm areas. Only a process
* with root privileges can extend cpumemmaps to include
* additional system CPUs or memory blocks. Changing a map
* will cause kernel scheduling code to immediately start
* using the new system CPUs, and cause kernel allocation
* code to allocate additional memory pages using the new
* system memory blocks, but memory already allocated on old
* blocks will not be migrated, unless by some other means.
*
* The bulk of the kernel is still written using whatever
* system CPU and memory block numbers are appropriate for
* a system. Changes to cpumemmaps are converted at the time
* of the cmsSet*() calls into changes to the system masks
* (such as cpus_allowed) and lists (such as zone lists)
* used by the existing scheduler and allocator.
*
* 2) cpumemset (cms):
*
* The upper layer specifies on which of the application
* CPUs known to that process a task can be scheduled, and
* in which application memory blocks known to the kernel
* or that vm area, memory can be allocated. The kernel
* allocators search the memory block lists in the given
* order for available memory, and a different list is
* specified for each CPU that may execute the request.
* An application may change the cpumemset of its tasks
* and vm areas, and root may change the cpumemset used
* for kernel memory allocation. Also root may change the
* cpumemsets of any process, and any process may change the
* cpumemsets of other processes with the same uid (kill(2)
* permissions).
*
*
* Each task has two cpumemsets, one defining its *current* CPU
* allocation and created vm areas, and one that is inherited by
* any *child* process it forks. Both the *current* and *child*
* cpumemsets of a newly forked process are set to copies of
* the *child* cpumemset of the parent process. Allocations of
* memory to existing vm areas visible to a process depend on
* the cpumemset of that vm area (as acquired from its creating
* process at creation, and possibly modified since), not on
* the cpumemset of the currently accessing task.
*
* During system boot, the kernel creates and attaches a
* default cpumemmap and cpumemset that is used everywhere.
* By default this initial map and set contain all CPUs and
* all memory blocks. The memory blocks are not necessarily
* sorted in any particular order, though provision is made for
* an architecture specific hook to code that can rearrange
* this initial cpumemset and cpumemmap. An optional kernel
* boot command line parameter causes this initial cpumemmap
* and cpumemset to contain only the first CPU and one memory
* block, rather than all of them, for the convenience of system
* management services that wish to take greater control of
* the system.
*
* The kernel will only schedule a task on the CPUs in the tasks
* cpumemset, and only allocate memory to a user virtual memory
* area from the list of memories in that areas memory list.
* The kernel allocates kernel memory only from the list of
* memories in the cpumemset attached to the CPU executing the
* allocation request, except for specific calls with the kernel
* that specify some other CPU or memory block.
*
* Both the *current* and *child* cpumemmaps and cpumemsets of
* a newly forked process are taken from the *child* settings
* of its parent, and memory allocated during the creation of
* the new process is allocated according to the parents *child*
* cpumemset and associated cpumemmap, because that cpumemset is
* acquired by the new process and then by any vm area created
* by that process.
*
* The cpumemset (and associated cpumemmap) of a newly created
* virtual memory area is taken from the *current* cpumemset
* of the task creating it. In the case of attaching to an
* existing vm area, things get more complicated. Both mmap'd
* memory objects and System V shared memory regions can be
* attached to by multiple processes, or even attached to
* multiple times by the same process at different addresses.
* If such an existing memory region is attached to, then by
* default the new vm area describing that attachment inherits
* the *current* cpumemset of the attaching process. If however
* the policy flag CMS_SHARE is set in the cpumemset currently
* linked to from each vm area for that region, then the new
* vm area will also be linked to this same cpumemset.
*
* When allocating another page to an area, the kernel will
* choose the memory list for the CPU on which the current
* task is being executed, if that CPU is in the cpumemset of
* that memory area, else it will choose the memory list for
* the default CPU (see CMS_DEFAULT_CPU) in that memory areas
* cpumemset. The kernel then searches the chosen memory list
* in order, from the beginning of that memory list, looking
* for available memory. Typical kernel allocators search the
* same list multiple times, with increasingly aggressive search
* criteria and memory freeing actions.
*
* The cpumemmap and cpumemset calls with the CMS_VMAREA apply
* to all future allocation of memory by any existing vm area,
* for any pages overlapping any addresses in the range [start,
* start+len), similar to the behavior of madvise, mincore
* and msync.
*
* Interesting Error Cases:
*
* If a request is made to set a cpumemmap that has fewer CPUs
* or memory blocks listed than needed by any cpumemsets that
* will be using that cpumemmap after the change, then that
* cmsSetCMM() will fail, with errno set to ENOENT. That is,
* you cannot remove elements of a cpumemmap that are in use.
*
* If a request is made to set a cpumemset that references CPU
* or memory blocks not available in its current cpumemmap,
* then that cmsSetCMS() will fail, with errno set to ENOENT.
* That is, you cannot reference unmapped application CPUs
* or memory blocks in a cpumemset.
*
* If a request is made to set a cpumemmap by a process
* without root privileges, and that request attempts to
* add any system CPU or memory block number not currently
* in the map being changed, then that request will fail,
* with errno set to EPERM.
*
* If a cmsSetCMS() request is made on another
* process, then the requesting process must either have
* root privileges, or the real or effective user ID of
* the sending process must equal the real or saved
* set-user-ID of the other process, or else the request
* will fail, with errno set to EPERM. These permissions
* are similar to those required by the kill(2) system call.
*
* Every cpumemset must specify a memory list for the
* CMS_DEFAULT_CPU, to ensure that regardless of which CPU
* a memory request is executed on, a memory list will
* be available to search for memory. Attempts to set
* a cpumemset without a memory list specified for the
* CMS_DEFAULT_CPU will fail, with errno set to EINVAL.
*
* If a request is made to set a cpumemset that has the same
* CPU (application number) listed in more than one array
* "cpus" of CPUs sharing any cms_memory_list_t, then the
* request will fail, with errno set to EINVAL. Otherwise,
* duplicate CPU or memory block numbers are harmless, except
* for minor inefficiencies.
*
* The operations to query and set cpumemmaps and cpumemsets
* can be applied to any process (any pid). If the pid is
* zero, then the operation is applied to the current process.
* If the specified pid does not exist, then the operation
* wil fail with errno set to ESRCH.
*
* Not all portions of a cpumemset are useful in all cases.
* For example the CPU portion of a vm area cpumemset is unused.
* It is not clear as of this writing whether CPU portions of the
* kernels cpumemset are useful. When setting a CMS_KERNEL or
* CMS_VMAREA cpumemset, it is acceptable to pass in a cpumemset
* structure with an empty cpu list (nr_cpus == 0 and *cpus ==
* NULL), and such an empty cpu list will be taken as equivalent
* to passing in the cpu list from the *current* cpumemset of
* the requesting process.
*
* A /proc interface should be provided to display the cpumemset
* and cpumemmap structures, settings and connection to tasks,
* vm areas, the kernel, and system and application CPUs
* and memory blocks. This /proc interface is to be used by
* system utilities that report on system activity and settings.
* The CpuMemSet interface described in this file is independent
* of that /proc reporting interface.
*
* None of this CpuMemSet apparatus has knowledge of distances
* between nodes or memory blocks in a NUMA system. Presumably
* other mechanisms exist on such large machines to report
* to system services and tools in user space the topology
* and distances of the system processor, memory and I/O
* architecture, thus enabling such user space services to
* construct cpumemmaps and cpumemsets with the desired structure.
*
* System services and utilities that query and modify cpumemmaps
* identify maps by one of:
* CMS_CURRENT - specifying a process id, for the *current*
* map attached to that process
* CMS_CHILD - specifying a process id, for the *child*
* map attached to that process
* CMS_VMAREA - specifying a process id and virtual address
* range [start, start+len], for the map attached
* to the pages in that address range of that process
* CMS_KERNEL - for the kernel (pid, start and len args not used)
*
* System services and utilities that query and modify cpumemsets
* identify sets by one of:
* CMS_CURRENT - specifying a process id, for the *current*
* set attached to that process
* CMS_CHILD - specifying a process id, for the *child*
* set attached to that process
* CMS_VMAREA - specifying a process id and virtual address
* range [start, start+len], for the set attached
* to the pages in that address range of that process
* CMS_KERNEL - for the kernel (pid, start and len args not used)
*
* This API is not directly implemented by dedicated system
* calls, but rather by adding options to a lower level general
* purpose system call. That low level API (currently using
* prctl) should not be used by applications, and is subject
* to change. Rather use this CpuMemSet API, which should
* be stable over time. To the extent consistent with the
* evolution of Linux and as resources permit, changes to this
* API will preserve forward and backward, source and binary
* compatibility for both kernel and application.
*
* The cpumemmaps and cpumemsets returned by the cmsQuery*()
* routines are constructed using a malloc() for each separate
* structure and array, and should, when no longer needed, by
* freed with a cmsFreeCMM() or cmsFreeCMS() call, to free()
* that memory.
*/
#if defined(sgi)
typedef unsigned short int uint16_t;
typedef int pid_t;
typedef unsigned int size_t;
#else
#include "stdint.h"
#endif
#define CMS_DEFAULT 0x01 /* Memory list order (first-touch, typically) */
#define CMS_SHARE 0x04 /* Inherit virtual memory area CMS, not task */
typedef int cms_setpol_t; /* Type of policy argument for sets */
/* 16 bits gets us 64K CPUs ... no one will ever need more than that! */
typedef uint16_t cms_acpu_t; /* Type of application CPU number */
typedef uint16_t cms_amem_t; /* Type of application memory block number */
typedef uint16_t cms_scpu_t; /* Type of system CPU number */
typedef uint16_t cms_smem_t; /* Type of system memory block number */
#define CMS_DEFAULT_CPU ((cms_acpu_t)-1) /* Marks default Memory List */
/* Calls to query and set cmm and cms need to specify which one ... */
#define CMS_CURRENT 0 /* cmm or *current* cms of this process */
#define CMS_CHILD 1 /* *child* cms of this process */
#define CMS_VMAREA 2 /* cmm or cms of vmarea at given virtual addr */
#define CMS_KERNEL 3 /* cmm or cms of kernel (root-only) */
typedef int cms_choice_t; /* Type of cmm/cms choice argument */
/* cpumemmap: Type for the pair of maps ... */
typedef struct cpumemmap {
int nr_cpus; /* number of CPUs in map */
cms_scpu_t *cpus; /* array maps application to system CPU num */
int nr_mems; /* number of mems in map */
cms_smem_t *mems; /* array maps application to system mem num */
} cpumemmap_t;
/*
* How memory looks to (typically) a set of equivalent CPUs,
* including which memory blocks to search for memory, in what order,
* and the list of CPUs to which this list of memory blocks applies.
* The cpumemset is sufficiently complex that this portion of the
* data structure type is specified separately, then an array of
* cms_memory_list_t structures is included in the main cpumemset type.
*/
typedef struct cms_memory_list {
int nr_cpus; /* Number of CPUs sharing this memory list */
cms_acpu_t *cpus; /* Array of CPUs sharing this memory list */
int nr_mems; /* Number of memory blocks in this list */
cms_amem_t *mems; /* Array of 'nr_mems' memory blocks */
} cms_memory_list_t;
/*
* Specify a single cpumemset, describing on which CPUs to
* schedule tasks, from which memory blocks to allocate memory,
* and in what order to search these memory blocks.
*/
typedef struct cpumemset {
cms_setpol_t policy; /* or'd CMS_* set policy flags */
int nr_cpus; /* Number of CPUs in this cpumemset */
cms_acpu_t *cpus; /* Array of 'nr_cpus' processor numbers */
int nr_mems; /* Number of Memory Lists in this cpumemset */
cms_memory_list_t *mems;/* Array of 'nr_mems' Memory Lists */
} cpumemset_t;
/* Manage cpumemmaps (need perms like kill(2), must be root to grow map) */
cpumemmap_t *cmsQueryCMM (cms_choice_t c, pid_t pid, void *start);
int cmsSetCMM (
cms_choice_t c, pid_t pid, void *start, size_t len, cpumemmap_t *cmm);
/* Manage cpumemsets (need perms like kill(2), must be root to grow map) */
cpumemset_t *cmsQueryCMS (cms_choice_t c, pid_t pid, void *start);
int cmsSetCMS (
cms_choice_t c, pid_t pid, void *start, size_t len, cpumemset_t *cms);
/* Return application CPU number currently executing on */
cms_acpu_t cmsGetCPU(void);
/* Free results from above cmsQuery*() calls */
void cmsFreeCMM (cpumemmap_t *cmm);
void cmsFreeCMS (cpumemset_t *cms);
#endif
This concludes the display of the C language cpumemsets.h header file and the portion of this document subject to GPL licensing. Example Showing Interaction of cpumemmaps and cpumemsets
Given the following hardware configuration:
Let's say we have a four node system, with four CPUs
per node, and one memory block per node, named as follows:
Name the 16 CPUs: c0, c1, ..., c15 # 'c' for CPU
and number them: 0, 1, 2, ..., 15 # cms_pcpu_t
Name the 4 memories: mb0, mb1, mb2, mb3 # 'mb' for memory block
and number them: 0, 1, 2, 3 # cms_pmem_t
cpumemmap:
Now lets say the administrator (root) chooses to setup a
Map containing just the 2nd and 3rd node (CPUs and memory
thereon). The cpumemmap for this would contain:
{
8, # nr_cpus (length of CPUs array)
p1, # CPUs (ptr to array of cms_pcpu_t)
2, # nr_mems (length of mems array)
p2 # mems (ptr to array of cms_pmem_t)
}
where p1, p2 point to arrays of system CPU + mem numbers:
p1 = [ 4,5,6,7,8,9,10,11 ] # CPUs (array of cms_pcpu_t)
p2 = [ 1,2 ] # mems (array of cms_pmem_t)
This map shows, for example, that for this Map,
application CPU 0 corresponds to system CPU 4 (c4).
cpumemset:
Further lets say that an application running within this
map chooses to restrict itself to just the odd-numbered
CPUs, and to search memory in the common "first-touch"
manner (local node first). It would establish a
cpumemset containing:
{
CMS_DEFAULT, # cms_policy
4, # nr_cpus (length of CPUs array)
q1, # CPUs (ptr to array of cms_lcpu_t)
2, # nr_mems (length of mems array)
q2, # mems (ptr to array of cms_memory_list)
}
where q1 points to an array of 4 application CPU numbers
and q2 to an array of 2 memory lists:
q1 = [ 1,3,5,7 ], # CPUs (array of cms_lcpu_t)
q2 = [ # See "Verbalization example" below
{ 3, r1, 2, s1 }
{ 2, r2, 2, s2 }
]
where r1, r2 are arrays of application CPUs:
r1 = [1, 3, CMS_DEFAULT_CPU]
r2 = [5, 7]
and s1, s2 are arrays of memory blocks:
s1 = [0, 1]
s2 = [1, 0]
If a fault occurs on any of the 2 CPUs in r2, search the 2 memory blocks in s2 in order (mb2, then mb1). If a fault occurs on any other CPU, then since the CMS_DEFAULT_CPU value is listed in r1, search the 2 memory blocks in s1 in order (mb1, then mb2).
The meaning of "s2 = [1, 0]" is that if a page fault occurs on the application CPUs listed in "r2 = [5, 7]", then the same memory blocks are searched, but in the other order, mb2 then mb1. In particular, if a vm area using the above cpumemset was also shared with an application running on some other Map, and that application faulted while running on some CPU not explicitly listed in the above cpumemset (item r1 or r2), then the allocator would search mb1 first, then mb2, for available memory. This is because CMS_DEFAULT_CPU is listed amongst the CPUs in r1, and the corresponding s1 is equivalent to the ordered array of physical memory blocks [mb1, mb2].
October 8, 2001 Revision
| ||||