Developer Beta Release
November 13, 2005
Open|SpeedShop was designed to be modular and extensible. It supports the concept of plugins which allow users, if they so desire, to create their own performance experiments. The project is designed in such a way as to enable value-added plugins to be added to the open source version.
Another key feature of the performance tool is its usability. The user interface is designed for scientists not computer scientists. To make the tool usable for a greater range of users, the performance tool will provide novice user Wizards, easily understood language in the user interface components, and context sensitive help.
SGI will release, primarily under the GPL and LGPL Open Source licenses, the Open|SpeedShop baseline functionality. Support for single system image (SSI) machines, support for clusters (i.e., multiple OS kernels), exclusive and inclusive user time, program counter (PC) sampling, MPI call tracing, Input/Output tracing, Floating point exception tracing and CPU hardware performance counter experiments are the key components of the baseline functionality. The performance tool will be designed in such a manner that will allow users to easily extend the tool by adding their own experiments.
The performance tool will use dynamic probe class library (DPCL) as a building block component for cluster system support. The use of DPCL will provide a portable means for the performance tool to supply cluster support for platforms supported by the Dyninst dynamic instruction package. DPCL uses the Dyninst application programming interface (API) to provide its instrumentation capability.
Availability of the new performance tool based on SpeedShop technology on Linux platforms will enable Fortran (77, 90, and 95), C, and C++ programmers to use an advanced performance analysis tool within the Open Source environment.
Open|SpeedShop is oriented towards gathering and displaying performance data gathered from an application and relating that performance data back to the application's source file, function, and/or line number(s).
Technical terms can have multiple and/or context sensitive meanings, therefore this section attempts to explain and clarify the meanings of the terms used in this document.
The Summary of Experiments table below shows the possible
experiments you can perform using the Open|SpeedShop tools and the
reasons why you might want to choose a specific
experiment. The Clues column shows when you might use an experiment.
The Data Collected column indicates performance data collected by the
experiment. For detailed information on the experiments, see the
relevant section in the remainder of this chapter.
Table: Summary of Experiments
High system time. Presence of floating-point operations.
High user CPU time.
Counts at the source line, machine
instruction, and function levels of various hardware events, including:
clock cycles, graduated instructions, primary instruction cache misses,
secondary instruction cache misses, primary data cache misses,
secondary data cache misses, translation lookaside buffer (TLB) misses,
and graduated floating-point instructions. PC sampling is
High user CPU time.
Similar to hwc experiment, except that callstack
sampling is used.
Traces the following I/O system calls: read, readv,
write, writev, open, close,
dup, pipe, creat.
mpi performance is poor.
Traces and times calls to various MPI routines.
High user CPU time.
Actual CPU time at the
source line, machine instruction, and function levels by sampling the
program counter at 10 or 1-millisecond intervals.
Slow program, nothing else known. Not CPU-bound.
Inclusive and exclusive CPU time for each function by sampling
the callstack at 30-millisecond intervals.
The Open|SpeedShop performance tool is designed in such a way as to enable value-added plugins for the open source version. These value-added plugins can be the pre-canned performance analysis experiments, which contain the experiment specific collector and the corresponding command line interface and GUI support software. There are separate plugins for the collector, command line interface, and GUI software components. Plugins can also be written for advanced/enhanced versions of experiment collectors, command line interface and GUI analysis tools. These plugins allow the Open|SpeedShop performance tool to be enhanced by the open source community. Open|SpeedShop supports clustering. The open source component, DPCL, is used by Open|SpeedShop to provide the clustering capability. The use of the Dyninst component allows Open|SpeedShop to insert instrumentation and also to attach to running processes. See the Plugin API (Extensibility) Description section for a detailed description.
The following sections describe how to build and install Open|SpeedShop. The RPM packages that are needed to build Open|SpeedShop and the RPM packages that are needed to run Open|SpeedShop are defined below.
The rpm command:
rpm --install --nodeps openspeedshop-1.0-1.ia64.rpm
will install Open|SpeedShop executables into /usr/local/bin and the Open|SpeedShop plugins into /usr/local/lib.
The following table defines the RPMs that need to be installed on your system to run Open|SpeedShop.
Component Name Minimum Maximum Notes dpcl see-note-1 see-note-1 (1) dyninst see-note-1 see-note-1 (1) libdwarf 2004/02/03 2005/08/01 none libelf 0.8.5 0.8.5 (2)
libunwind 0.98.5 0.98.5 none papi
python 2.3.3 2.3.4 none sqlite 3.0.8 3.2.7 none qt 3.3.1 3.3.4 none
The tools that are the basis for Open|SpeedShop are the performance tool graphical user interface (GUI), the interactive command line (CLI), and the batch command. These are the three Open|SpeedShop tools that have user interfaces.
The Open|SpeedShop program will be invoked by the user typing the "openss" command. When the user invokes Open|SpeedShop command there are three options for it's mode of operation:
This invocation of Open|SpeedShop causes the GUI to be raised then a command panel is also created. This command panel window becomes the interactive CLI window. Under this invocation Open|SpeedShop interactive commands can be entered into the GUI's command panel and have the same effect as if they were entered under the "openss -cli" option.This invocation of Open|SpeedShop causes the window terminal becomes the interactive CLI window.This invocation of Open|SpeedShop causes Open|SpeedShop to execute a performance experiment. specified by additional arguments, directly without user interaction. The -batch operation can be used in scripts and batch processing environments.
Open|SpeedShop can be invoked in a number of ways. This section gives a few examples that will hopefully help the first time users to get started and serve as an introduction to the following sections about Open|SpeedShop tool usage.
As stated above in the "openss" command section, the Open|SpeedShop command "openss" will initiate the Open|SpeedShop performance tool. The type of execution of Open|SpeedShop arguments to the openss command are "-batch", "-cli", and "-gui". The default type of execution is GUI. So, "openss" with no parameters will invoke the Open|SpeedShop GUI tool. The following examples show a few simple typical usage scenarios for Open|SpeedShop.
prompt:> openss -f <pathTOexecutable>/executable_name pcsamp
The above command will bring up the GUI window with the executable loaded to run the PC Sampling experiment on localhost. All the user would need to do is push the "Run" button on the GUI and wait for the results to be displayed. The path to the executable can be a relative path.
prompt:> openss -cli
expcreate -hbigiron.sgi.com -f <pathTOexecutable>/executable_name usertime
The openss command above will invoke the command line interface in it's initial state. The user would then type the experiment create command (expcreate) and specify the host, executable file and experiment name. The command to run the experiment (expgo) will start the user application and performance experiment running. Currently the Open|SpeedShop command line interface is asynchronous. The user will get a prompt back immediately after issuing the expgo command. This allows the user to issue other commands and/or monitor performance data prior to completion of a long running job. If performance experiment data is available, the performance result viewing command (expview) can be used to view the data on demand. Open|SpeedShop will return the data that it currently knows about.
Create a file (batch.input) containing the command line commands:
expcreate -h bigiron.sgi.com -f <pathTOexecutable>/executable_name pcsamp
Then execute the Open|SpeedShop batch tool by using the "-batch" argument on the "openss" command:
openss -batch < batch.input
The Open|SpeedShop PC Sampling experiment will be run and the output will be output to stout.
To run Open|SpeedShop on a multiple node machine configuration, each node must have DPCL and Dyninst installed. Open|SpeedShop support for multiple nodes is through DPCL daemons. See the attaching to MPI jobs section for information on attaching to processes running on other hosts or partitions.
The Open|SpeedShop GUI contains a main window from which users can choose a wizard to help choose the proper experiment based on input to the wizard selecting questions. The GUI also contains a source view panel, a statistics panel, and command panel.
The GUI is bundled into a dynamic library that is loaded on demand. It's the Command Line Interface (CLI) that launches the GUI. By default the CLI will launch the GUI upon invocation of the Open|SpeedShop tool. However, the CLI can be started without starting the GUI ($ openss -cli) and then the GUI can loaded and initialized when needed via the CLI "openGui" command.
Upon invoking Open|SpeedShop ($ openss) the command line is parsed, and if the GUI is requested, the GUI library is loaded and launched. Open|SpeedShop then drops into an event loops, one for parsing command line events and the other for parsing GUI events.
When the GUI is loaded, the GUI looks for GUI plugins in the default directory and in the OPENSS_PLUGIN_DIR environment variable path. Each file in the directory is opened and an internal entry point is queried. If found, the plugin manager calls the entry point, initializes any exported menus, brings up the GUI, and then drops into the main event loop waiting for user interaction.
Save Experiment Data menu item. Performance data contained in the experiment may be reexamined and redisplayed once the experiment is opened again.
Open Saved Experiment menu item.
In reality at this, the Developer Beta delivery, the PC Sampling, Usertime and HW Counter experiments are developed. Clicking on PC Sampling will bring up the program counter (PC) Sampling experiment panel as shown in the PC Sampling Experiment section.
In the Open|SpeedShop hardware counter experiments, overflows of a particular hardware counter are recorded. Each hardware counter is configured to count from zero to a number designated as the overflow value. When the counter reaches the overflow value, the system resets it to zero and increments the number of overflows at the present program instruction address. Each experiment provides two possible overflow values; the values are prime numbers, so any profiles that seem the same for both overflow values should be statistically valid.
The experiments described in this section are available for systems that have hardware counters. Hardware counters allow you to count various types of events, such as cache misses and counts of issued and graduated instructions.
A hardware counter works as follows: for each event, the appropriate hardware counter is incremented on the processor clock cycle. For example, when a floating-point instruction is graduated in a cycle, the graduated floating-point instruction counter is incremented by 1.
These experiments are detailed by nature. They return information gathered at the hardware level. You probably want to run a higher level experiment first. Once you have narrowed the scope, you can use hardware counter experiments to pinpoint the area to be tuned.The following sections describe hardware counter experiments available in Open|SpeedShop.
Hardware counter overflow profiling experiments should incur a slowdown of execution of the program of no more than 5%.
The hwctime hardware counter experiments also show where the overflows are being triggered in the program. These experiments are similar to the hwc experiments, but record the callstack information rather than showing where the program counter was when the overflow occurred.
The pcsamp experiment estimates the actual CPU time for each source code line, machine code line, and function in your program. The Command Line Interface performance results listing and the GUI performance results panel of this experiment show both inclusive and exclusive PC sampling time. This experiment is a lightweight, high-speed operation that makes use of the operating system.
CPU time is calculated by multiplying the number of times an instruction or function appears in the PC by the interval specified for the experiment (for example: 1 or 10 milliseconds).To collect the data, the operating system regularly stops the process, increments a counter corresponding to the current value of the PC, and resumes the process. The default sample interval is 10 milliseconds.
PC sampling runs should slow the execution time of the program down no more than 5 percent. The measurements are statistical in nature, meaning they exhibit variance inversely proportional to the running time.
The usertime experiment is a useful experiment to start your performance analysis. The usertime experiment returns CPU time for each function while your program runs.
This experiment uses a statistical call stack profiling to measure inclusive and exclusive user time. It takes a sample every 30 milliseconds. Data is measured by periodically sampling the callstack. The program's callstack data is used to do the following:
Attribute exclusive user time to the function at the bottom of each callstack (that is, the function being executed at the time of the sample).
Attribute inclusive user time to all the functions above the one currently being executed (those involved in the chain of calls that led to the function at the bottom of the callstack executing).
The time spent in a procedure is determined by multiplying the number of times an instruction for that procedure appears in the stack by the sampling time interval between call stack samples. Call stacks are gathered when the program is running; hence, the time computed represents user time, not time spent when the program is waiting for a CPU. User time shows both the time the program itself is executing and the time the operating system is performing services for the program, such as I/O.
The usertime experiment should incur a program execution slowdown of no more than 15%. Data from a usertime experiment is statistical in nature and shows some variance from run to run.
Typical Open|SpeedShop GUI Wizard Usage
Currently there are three areas of preference processing:
The General preferences are, as the general title implies, for setting items that apply to the overall Open|SpeedShop tool. Items such as font characteristics, graphics, splash screen, and remote shell command processing are now supported.
The statistics related preferences apply to the performance experiment results presentation. How to sort the results, what column to sort from, and how many result items do you want displayed are the current options supported.
The source panel preferences are show line numbers and statistics. Statistics when set will present the performance results integrated with the source. See the source panel with statistics image at this link.
CLI command section for more information.
- Choosing experiment result metrics
- Choosing process, thread, or rank to view
- Comparing results
- Exporting/Saving result data
Having launched the GUI via either the openss -gui or by default with the openss default command (no -cli, -batch, or -gui options), the initial window will look like this:
Given this window the user can answer the wizard questions and proceed by clicking on the Next button on the lower right hand side of the Open|SpeedShop GUI window. In this example the user has chosen the default option which is to find out where the time is spent in the user's yet to be defined application. When the user clicks on the Next button this is the window that appears.
The above panel/window is the introduction panel to the PC (Program Counter) Sampling Experiment Wizard. This panel explains what the Open|SpeedShop experiment, named PC sampling does. The program counter sampling experiment takes periodic samples of the machines program counter and stores them. Later in analysis the Open|SpeedShop tool associates the program counter addresses with the user's application and reports which functions, and/or source lines were executed during the applications execution. The user can now click on Next to proceed with the wizard process, go back, or quit the wizard by clicking on the Finish button. Here is the next window in the wizard process assuming the user clicked on the Next button.
The above panel/window is the parameter selection panel for the PC (Program Counter) Experiment Wizard. This panel allows the user to set the sampling rate at which the PC sampling experiment will sample the program counter and save that address as the experiment measurement data. The Program Counter experiment takes the periodic samples of the machines program counter and stores them. Later in analysis the Open|SpeedShop tool associates the program counter addresses with the user's application and reports which functions, and/or source lines were executed during the applications execution. The user can now click on Next to proceed with the wizard process, go back, or quit the wizard by clicking on the Finish button. The user may also click on the Reset button, which will reset the parameter to the default value. Here is the next window in the wizard process assuming the user clicked on the Next button.
The panel/window above allows the user to select the executables or attach to a set of running processes. By selecting the "Load an Executable from Disk" item the user will cause a selection window to appear. Using the selection window the user can click on executables to be the application that the PC sampling experiment will gather data for.
The panel above shows the executable selection window which allows the user to select the executable they would like to load and subsequently have performance analysis done on.
An alternative to the loading the executable is to attach to a running process. For example in this case, had the executable, fred, been already running, the above dialog display allows the user to select the running process for fred, another process, or multiple processes. See the attaching to an already running MPI job section which discusses attaching to an entire MPI job using this dialog display.
The panel/window above summarizes the results of the user's choices and tells the user to complete the process of creating the PC sampling experiment the user should click on the Finish button. Once the user clicks on the Finish button the PC sampling experiment window will appear.
The panel/window above is the PC Sampling experiment window. The experiment is ready to run. Note that in the Status output output area the executable mutatee is loaded.
The process control area provides icons that may be clicked on to control the execution of the experiment. The icon to action translation is as follows:
- => Run the experiment
- =>| Pause the experiment
- |=> Run to Next statement
- \> Step into the next statement
- ! Stop the experiment
The source panel contains the source associated with the loaded application/executable. To run the experiment, click on the right arrow icon, which corresponds to the Run button. Doing this will engage Open|SpeedShop to start the application and to gather the PC Sampling performance data at the sampling rate chosen in the previous step(s).
After splitting the panels appear side by side. Clicking on the "f3" function results line will focus the source panel to the corresponding function "f3" source file and line number, if the application was compiled to include source debugging information such as dwarf. Some compilers do not include source debugging information when invoked at high optimization levels. In that case you may only have the function name but no source line information, so the click mentioned above will not be able to focus to the source for the function selected.
Clicking on the arrows to the right of the Time or Function header under the PC Stats Panel tab will sort the time or function list in ascending or descending order.
The interactive command line interface tool accepts a number of Open|SpeedShop commands. These commands allow the user to create performance measurement experiments, attach executables, run the experiment to gather performance metric data, and also to display the data to the screen via the text view commands or to launch the GUI to view the performance experiment data.
The interactive command syntax document contains the commands that the Interactive CLI tool accepts as input. A simple scenario to illustrate usage, both in command only form and command and explanation form. Click on this link to view the scenario.
The CLI is bundled into a dynamic library that is loaded on demand. It's the Open|SpeedShop main program that launches the CLI. By default the openss command will launch the GUI upon invocation of the Open|SpeedShop tool. However, the CLI can be started without starting the GUI ($ openss -cli).
Upon invoking Open|SpeedShop ($ openss -cli) the command line is parsed, and if the CLI is requested, the CLI library is loaded and launched. Open|SpeedShop then drops into an event loop for parsing command line events.
When the CLI or GUI is loaded, they look for CLI and/or GUI plugins in the default directory and in the OPENSS_TOOL_PLUGIN_DIR environment variable path. Each file in the directory is opened and an internal entry point is queried. If found, the plugin manager calls the entry point, initializes any exported menus, brings up the CLI and/or GUI, and then drops into the main event loop(s) waiting for user interaction.
Having launched the CLI via the openss -cli the initial window will look like this:
machine.prompt>./openss -cliAt this point, user's may enter one of the commands described in Appendix A: Command Syntax. These commands are primarily related to creating, running, and monitoring performance experiments. There are also information commands which give machine information. An typical usage example of a PC Sampling Experiment follows:
[prompt] : openss -cli
# The first user command is to create an experiment. In this example
# it is a PC Sampling experiment (pcsamp) and it will be run on the
# mutatee executable.
openss>>expcreate -f /home/openss/demo/Simple/mutatee pcsamp
# Below a "1" is returned to indicate the experiment number
# The next user command is "expGo" which runs the experiment
# The next three lines are output from the executable's execution
Usage: /home/openss/demo/Simple/mutatee <size>
No size argument given. Defaulting to 250.
/home/openss/demo/Simple/mutatee: successfully completed.
# The next user command tells Open|SpeedShop to print the results of the experiment
# NOTE - the 5 at the end of the stats parameter indicates you want to view the top 5 functions.
# If you had specified 33, as in stats33, you would see the top 33 functions in the performance report.
# The next four lines are the output of the experiment due to the expView command
CPU Time (Seconds) Function
# Open|SpeedShop prompt for additional command input
expcreate -f /home/openss/demo/Simple/mutatee pcsamp
[machine prompt]: openss -batch < batch.input
# Experiment id "1" returned from the experiment create command
# The next three lines are mutatee executable output
Usage: /home/openss/demo/Simple/mutatee <size>
No size argument given. Defaulting to 250.
/home/openss/demo/Simple/mutatee: successfully competed.
# The next four lines are output due to the expView stats5 command
CPU Time (Seconds) Function