pcp
[Top] [All Lists]

Lustrecomm help love

To: pcp@xxxxxxxxxxx
Subject: Lustrecomm help love
From: Scott Emery <emery@xxxxxxx>
Date: Tue, 15 Jun 2010 23:01:19 -0500 (CDT)
        Let me know if this needs more love...

Scott Emery
emery@xxxxxxx

---
commit d20b823b7a9ffd7a5eb468636e4c3d4536449fd7
Author: Scott Emery <emery@xxxxxxx>
Date:   Tue Jun 15 20:47:10 2010 -0700

    Put something more informative into the Lustrecomm help file.

diff --git a/src/pmdas/lustrecomm/help b/src/pmdas/lustrecomm/help
index 0876863..af3b27b 100644
--- a/src/pmdas/lustrecomm/help
+++ b/src/pmdas/lustrecomm/help
@@ -34,35 +34,73 @@
 # blank lines before the @ line are ignored
 #
 
-@ lustrecomm.timeout
-
-@ lustrecomm.ldlm_timeout
-
-@ lustrecomm.dump_on_timeout
-
-@ lustrecomm.lustre_memused
-
-@ lustrecomm.lnet_memused
-
-@ lustrecomm.stats.msgs_alloc unknown
-
-@ lustrecomm.stats.msgs_max unknown
-
-@ lustrecomm.stats.errors unknown
-
-@ lustrecomm.stats.send_count unknown
-
-@ lustrecomm.stats.recv_count unknown
-
-@ lustrecomm.stats.route_count unknown
-
-@ lustrecomm.stats.drop_count unknown
-
-@ lustrecomm.stats.send_length unknown
-
-@ lustrecomm.stats.recv_length unknown
-
-@ lustrecomm.stats.route_length unknown
-
-@ lustrecomm.stats.drop_length unknown
+@ lustrecomm.timeout  contents of /proc/sys/lustre/timeout
+The time period that a client waits for a server to complete an RPC
+(default in 1.6 is 100s).   Servers wait half this time for a normal
+client RPC to compelte and a quarter of this time for a single
+bulk request to complete.  The client pings recoverable targets
+(MDS and OSTs) at one quarter of the timeout, and the server
+waits on and a half times the timeout before evicting a client
+for being "stale".  (source: Lustre 1.6 Operations Manual)
+
+@ lustrecomm.ldlm_timeout contents of /proc/sys/lustre/ldlm_timeout
+This is the time period for which a server will wait for a client
+to reply to an initial AST (lock cancellation request). The default
+is 20s for an OST and 6s for an MDS.  (source: Lustre 1.6 Operations
+Manual)
+
+@ lustrecomm.dump_on_timeout  contents of /proc/sys/lustre/ldlm_timeout
+A 1 triggers dumps of the Lustre debug log when timeouts occur.
+Default value 0.  (source: Lustre 1.6 Operations Manual)
+
+@ lustrecomm.lustre_memused contents of /proc/sys/lustre/memused
+lustre/obdclass/linux/linux-sysctl.c: &proc_memory_alloc
+lustre/include/obd_support.h: obd_memory_sum()
+Total bytes allocated by Lustre (inferred from lustre/include/obd_support.h)
+
+@ lustrecomm.lnet_memused contents of /proc/sys/lnet/memused
+lnet/libcfs/linux/linux-proc.c: (int *)&libcfs_kmemory.counter
+Total bytes allocated by LNET. (inferred from lustre/include/obd_support.h)
+
+@ lustrecomm.stats.msgs_alloc first number from /proc/sys/lnet/stats
+routerstat source file: messages currently allocated (first number after M)
+
+@ lustrecomm.stats.msgs_max second number from /proc/sys/lnet/stats
+routerstat source file: messages maximum (highwater mark) (second
+number after M)
+
+@ lustrecomm.stats.errors third number from /proc/sys/lnet/stats
+routerstat source file: errors (number after E)
+
+@ lustrecomm.stats.send_count fourth number from /proc/sys/lnet/stats
+routerstat source file: send_count (raw data from which second number
+after S is derived).
+
+@ lustrecomm.stats.recv_count fifth number from /proc/sys/lnet/stats
+routerstat source file: recv_count (raw data from which second number
+after R is derived)
+
+@ lustrecomm.stats.route_count sixth number from /proc/sys/lnet/stats
+routerstat source file: route_count (raw data from which second number
+after R is derived)
+
+@ lustrecomm.stats.drop_count seventh number from /proc/sys/lnet/stats
+routerstat source file: drop_count (raw data from which second number
+after D is derived)
+
+@ lustrecomm.stats.send_length eigth number from /proc/sys/lnet/stats
+routerstat source file: send_length (raw data from which first number
+after S is derived)
+
+@ lustrecomm.stats.recv_length ninth number from /proc/sys/lnet/stats
+routerstat source file: recv_length (raw data from which first number
+after S is derived)
+
+@ lustrecomm.stats.route_length tenth number from /proc/sys/lnet/stats
+routerstat source file: route_length (raw data from which first number
+after R is derived)
+
+@ lustrecomm.stats.drop_length eleventh number from /proc/sys/lnet/stats
+routerstat source file: drop_length (raw data from which first number
+after D is derived)
 

---

Scott Emery

<Prev in Thread] Current Thread [Next in Thread>