June 2001
This guide provides information for understanding and using the Collect data collection tool and it's related utilities collgui and cfilt. It is designed primarily for system administrators.
© 2000 Compaq Computer Corporation
Compaq, the Compaq logo, and the Digital logo are registered in the U.S. Patent and Trademark Office. Alpha, AlphaServer, NonStop, TruCluster, and Tru64 are trademarks of Compaq Computer Corporation.
Microsoft and Windows NT are registered trademarks of Microsoft Corporation. Intel, Pentium, and Intel Inside are registered trademarks of Intel Corporation. UNIX is a registered trademark and The Open Group is a trademark of The Open Group in the United States and other countries. Other product names mentioned herein may be the trademarks of their respective companies.
Possession, use, or copying of the software described in this publication is authorized only pursuant to a valid written license from Compaq Computer Corporation or an authorized sublicensor.
Compaq Computer Corporation shall not be liable for technical or editorial errors or omissions contained herein. The information in this document is subject to change without notice.
Collect is a tool that collects operating system data under Compaq Tru64 UNIX Versions 4.x and 5.x. Collect is designed for high reliability and low system-resource overhead, and can be run in either interactive mode or historical mode.
Collect's tightly integrated associated tools, collgui and cfilt, provide filtering and display capability for collected data.
The cfilt utility allows the arbitrary selection of values from the output of Collect. It condenses the output of Collect into 1 line per sample, or per a given number of samples, if using the option to average over a number of samples. The data in this form can then be graphed using gnuplot or Excel.
cfilt can also be used live, that is, as a filter to Collect while it's collecting and writing to standard output. This only works if no normalization is being done, as that requires that all samples be seen so that cfilt can determine the highest value, which is then used to normalize.
The collgui utility is intented to help evaluate data gathered by collect. It operates as a coordinator among collect, cfilt, and gnuplot. collgui automates the extraction of information from a binary data file written by Collect, and directs it to gnuplot to produce a graphical rendition of the data.
The following table lists the main features and benefits of Collect.
This guide uses the following conventions:
The collect utility is a system monitoring tool that records or displays specific operating system and process data for a set of subsystems. Any set of the subsystems, such as File systems, message Queue, ttY, or Header can be included in or excluded from data collection. Data can either be displayed back to the terminal, or stored in either a compressed or uncompressed data file. Data files can be read and manipulated from the command line, or through use of command scripts.
To ensure that the collect utility delivers reliable statistics it locks itself into memory using the page locking function plock(), and by default cannot be swapped out by the system. It also raises its priority using the priority function nice(). However, these measures should not have any impact on a system under normal load, and they should have only a minimal impact on a system under extremely high load. If required, page locking can be disabled using the -ol command option and the collect utility's priority setting can be disabled using the -on command option.
Some collect operations use kernel data that is only accessible to root. System administration practice should not involve lengthy operations as root, therefore collect is installed with permissions set as 04750. This setting allows group (typically system) members to run collect with owner setuid permissions. If this is inappropriate in your environment, you may reset permissions to fit your needs.
You can configure collect to start automatically when the system is rebooted. This is particularly useful for continuous monitoring. To do this, use the rcmgr command with the set operation to configure the following values in /etc/rc.config*:
cariad >rcmgr set COLLECT_AUTORUN 1
A value of 1 sets collect to automatically start on reboot. A value of 0 (the default) causes collect to not start on reboot.
cariad >rcmgr set COLLECT_ARGS ""
A null value causes collect to start with the default values (command options) of :
-i60,120 -f /var/adm/collect.dated/collect -H d0:5,1w -W 1h -M 10,15
cariad >rcmgr set COLLECT_COMPRESSION 1
A value of 1 sets compression on. A value of 0 sets compression off.
See the rcmgr(8) reference page for more information.
The collect utility can read multiple binary data files using the -p option and play them back as one stream, with monotonically increasing sample numbers. It is also possible to combine multiple binary input files into one binary output file, by using the -p option with the input files and the -f option with the output file. Note that the collect utility will combine input files in whatever order you specify on the command line. This means that the input files must be in strict chronological order if you want to do further processing of the combined output file. You can also combine binary input files from different systems, made at different times, with differing subsets of subsystems for which data has been collected. Filtering options such as -e, -s, -P, and -D can be used with this function.
Where appropriate, data is presented in units per second. For example, disk data such as KiloBytes transferred, or the number of transfers, is always normalized for 1 second. This happens no matter what time interval is chosen. The same is true for the following data items:
· CPU interrupts, sytem calls, and context switches.
· Memory pages out, pages in, pages zeroed pages reactivated and pages copied on write.
· Network packets in, packets out, and collisions.
· Process user and system time consumed.
Other data is recorded as a snapshot value. Examples of this are: free memory pages, CPU states, disk queue lengths, and process memory.
A collection interval can be specified using the -i followed by an integer, optionally followed (without spaces) by comma or colon and another integer. If the optional second integer is given, this is a separate time interval which applies only to the process subsystem. The process interval must be a multiple of the regular interval. Collecting process information is more taxing on system resources than are the other subsystems and is not generally needed at the same frequency. Process data also takes up most space in the binary data-file. Generally, specifying a process-interval greater than 1 will significantly decrease the load the collector places on the system being monitored.
Use the -S (sort) and -nX (number) options to sort data by percentage of CPU usage and to save only X processes. Target specific processes using the -Plist option, where list is a list of process identifiers, comma-separated without blanks.
If there are many (greater than 100) disks connected to the system being monitored, use the -D option to monitor a particular set of disks.
The collect utility reads and writes gnuzip format compressed datafiles. Compressed output is enabled by default but can be disabled using the -oz command option. The extension .cgz is appended to the output filename, unless you specify the -oz command option. Older, uncompressed datafiles can be compressed using gzip, and the resulting files can be read by collect in their compressed form.
Compression during collection should not generate any additional CPU load. Because compression uses buffers and therefore does not write to disk after every sample, it makes fewer system calls and its overall impact is negligible. However, because the output is buffered there is one possible drawback. If collect terminates abnormally (perhaps due to a system crash) more data samples will be lost than if compression is not used. This should not be an important consideration for most users, as you can specify how often data is written to the disk.
You can select samples from the total period of the time that data collection ran. Use the -C option to specify a start time and optionally and end time. The format is as follows:
[+]Year:Month:Day:Hour:Minute:Second.
The plus sign + indicates that the time should be interpreted as relative to the beginning of the collection period. If any of the fields are excluded from the string, the corresponding values from the start time are used in their place as the time-value is parsed from right to left. Thus, the field one is interpreted as Second, field two (if there is one), as Minute, and so on. For example, if the collection period is from October 21, 1999, 16:44:03 to October 21, 1999, 16:54:55, all but minutes and seconds can be ommitted from the command option: -C46:00,47:00 (from 16:46:00 to 16:47:00). However, if the collection ran overnight, it is necessary to specify the day as well. For example, when the period is Oct 21 16:44 to Oct 22 9:30, enter the following command to specify a time range from 23:00 to 1:00:
The following command options are useful:
If you want simultaneous text (ascii) output to the screen while collecting to a file, use the -a option.
The -t option prefixes each data line with a unique tag. This makes it easier for your scripts to find and to extract data. Tags are superfluous if you use the perl script cfilt.
The -T option shuts off collection for all subsystems except disk, and only display a total MB/sec across all disks in system. Use the -s option with the -T option to override this behavior and collect data for other subsystems.
The -R option causes collect to terminate after a specified amount of time.
All flags that can reasonably be applied to both collection and playback will work. The -Plist filter option used during collection will collect data only for the processes you specify. During playback it will only display data for the corresponding processes. To save space in the binary data file, you can limit your collection to specific processes, specific disks, or specific subsystems. However, if you want to look at volumes of data and select different chunks at a time, you should collect everything and later use the filter options to select data items during playback.
Note that under certain circumstances the Disk Statistics may be only approximate. Providing you use the latest collect versions and operating system patches, data is presented for all statistics except %BSY, which is zero. In this release, ACTQ and WTQ are absolutely accurate. For older releases of collect, some data fields were zero and data in some fields could be inaccurate under certain circumstances.
In this release, collect automatically reads older datafile versions when playing back files.
You can convert an older collect version datafile to the current version using the -p collect_datafile option with the the -f file. During conversion you can use most command options to extract specific data from the input collect_datafile. For example:
· Use the -s and -e options to select data only from particular subsystems.
· Use the -nX and -S options to take only X processes and sort them by CPU usage.
· Use the -D option to select disks and the -L option to select LSM volumes.
· Use the -P, -PC, -PU, -PP options to select processes based on their identifiers.
· Use the -C option to extract data according to specified start and stop times.
With the filtering capability of cfilt you can fine tune Collect to provide the specific values you wish to focus on, without the clutter of all the system information. The utility provides a number of options to allow selection, averaging, and choosing between real-time and historical operation modes.
The following options are available for cfilt:
Epressions are also accepted as options.
An expression has the following syntax:
<subsystem>:<selection-criterion>:<tag-expr1>:<tag-expr2>:<...>:<tag-exprN>
A subsystem can be one of: proc, disk, mem, net, cpu, sin, file, tty, and lsm (first 3 characters are significant)
If a plus-sign (+) is on the end, or no selection criterion has been given, then numerical values are summed for all lines of a subsystem. If a selection criterion has been provided, and there is no plus sign on the end of the subsystem name, then for each value in the selection criterion, the corresponding values for each <tag-expr> will be printed. For example, given the following output from Collect:
# DISK Statistics #DSK NAME B/T/L R/S RKB/S W/S AVS QLEN %BUSY 0 rz1 0/1/0 5 300 10 10 0 70 1 rz2 0/2/0 7 400 11 10 0 80 2 rz3 0/3/0 9 500 12 10 0 90
Assuming that cfilt is called with the single following expression,
disk:r/s
cfilt would sum reads/second for all disks. That is, 5+7+9=21. The output of cfilt would be:
<epoch-seconds> <sample#> 21
The expression disk+:name=rz1,rz2:r/s would sum reads/second for disks rz1 and rz2, 5+7=12. (name=rz1,rz2 is a selection-criterion, which is discussed below.) The output of cfilt would be:
<epoch-seconds> <sample#> 12
The expression disk+:name=rz1,rz2:rkb/s+wkb/s would sum KiloBytes read and written for disks rz1 and rz2, 300+400+1000+2000=3700, as follows (rkb/s+wkb/s is a tag-expression, which is discussed below.):
<epoch-seconds> <sample#> 3700
The expression disk:name=rz1,rz2:r/s would print reads/second for rz1 and reads/second for rz2, as follows:
<epoch-seconds> <sample#> 5 7
A selection criterion is a field tag (see tag-expr) on the left of an equals-sign, and a comma-separated list of values in that field that should be selected.
<tag>=<value>[,<value>[,<value>[...]]]
Examples: pid=1234,1235,8888, command=init, name=rz0,rz1
Tags are the column labels used by Collect. For example, in the disk subsystem, the tags are dsk, name, b/t/l, r/s, and so on. A tag-expr can be anything from a complicated arithmetic expression to simply the name of a collect output field, such as rss.
tag1+tag2 |
add values tag1 and tag2 |
tag1-tag2 |
subtract |
tag1*tag2 |
multiply |
tag1~tag2 |
divide tag1 by tag2 |
log(tag1) |
functions |
(100-tag1)~tag2 |
constants and grouping |
If a percent sign (%) is appended to the <tag-expr>, all values are normalized to 100, or if an integer follows the percent sign, then it is used instead of 100. This is useful for graphing results simultaneously
The available functions are: cos, sin, tan, sqrt, log, exp, abs, atan2, int, and convtime for converting Minutes:Seconds.TenthsHundreths to Seconds.TenthsHundreths.
If a subsystem has multiple lines/sample, values are added for all lines that match in one record. (if no selection critieron, then all are taken)
No white space is allowed in expressions.
If using parenthesis for grouping or functions, be sure to surround the expression in single-quotes
You can only give one expression per subsystem to cfilt. This means, among other things, that you can't have summed data for a subsystem and individual graphs for all the things you summed.
The first two columns of output are always <epoch-seconds> <sample#>.
The first two columns in cfilt's output are always the sample number and the epoch-second. The epoch-second is the internal UNIX time format, the number of seconds since the beginning of the epoch, January 1st, 1970. This is extracted directly from the Collect output. At the beginning of each record there is a line similar to the following:
#### RECORD 1 (873230968:160) (Tue Sep 2 22:09:28 1997) ####
in this example, epoch-seconds is 873230968. The sample number is also extracted from this line. In this example it is 1.
The following are examples of cfilt usage:
cfilt -fdata.in cpu:user+nice:intr%:sysc%:cs%
Output: <seconds> <sample#> <user+nice> <interrupts> <syscalls> <conswitch> (where interrupts,syscalls,conswitch are normalized)
cfilt -fdata.in proc+:user=urban:rss
Output: <seconds> <sample#> <RSS (resident set size) for all processes owned by urban>
cfilt -fdata.in cpu:idle net:inpkt+outpck% mem:free%
Output: <seconds> <sample#> <cpu:idle> <net:inpck+outpck(normalized)> <mem:free(normalized)>
cfilt -fdata.in pro:pid=1234,8888:rss:vsz
Output: <seconds> <sample#> <rss(pid=1234)> <vsz(pid=1234)> <rss(pid=8888)> <vsz(pid=8888)>
cfilt -fdata.in pro+:pid=1234,8888:rss:vsz
Output: <seconds> <sample#> <rss(sum for pid 1234,8888)> <vsz(sum)>
cfilt -fdata.in pro+:rss:vsz
Output: <seconds> <sample#> <rss(sum all procs)> <vsz(sum for all procs)>
The collgui utility relies on cfilt to evaluate data gathered by Collect. Therefore, understanding cfilt, especially if you want to do complicated or nonstandard operations will help you get the most out collgui.
The data extraction performed by collgui is robust but not quick. Processing the enormous amount of data that Collect can generate takes some time, so collgui offers two different methods for selecting samples for graphing.
When you save a user-defined setting/configuration, a unique ID is saved with it, consisting of filename (no path) plus file size. When you recall this setting, if the unique ID of your current open data file matches the saved one, things like START:, END:, X-range, Y-range, average samples, X- units, and samples w/process data are also restored. If the unique ID's don't match, then only the subsystem settings are restored.
The mechanism by which one of many objects (such as LSM Volumes, Disks, Tapes, Single CPUs) can be selected is a bit particular. If there are less than a fixed number of objects (~30), a MenuButton is created (when Add is pressed a vertical list is presented). If the the number is greater than this constant, a separate window is created with a listbox containing all possible objects. A double-click on an object in the listbox will add it to the selection listbox.
The selection mechanism for processes deserves special mention: this is always a separate window with a listbox and a slider marked sample and a button marked List Processes next to it. Using the slider, a sample (record) can be selected from the collection period, and double clicking on a process will enter its PID in the selection listbox.Processes are selected using their PIDs. At the top of each column is a button that turns red when the mouse is over it. Pressing the button will sort the list using the values in the button's column.
This is a description of the main window of collgui, from top to bottom:
FILE indicator, START: and END:
The following example serves as a quick guide for those who don't need to learn cfilt first.
collgui relies on the default colors of Tk, which is usually OK. However, under CDE there are problems. If you have a problem seeing text in the entry widgets, try putting the following line in your ~/.Xdefaults file.
Collgui*foreground: black Collgui*background: white
Merge this change into your in-memory resource database with this command:
xrdb -merge ~/.Xdefaults