From nicolas.guillier@airbus.com Thu Jul 1 01:56:15 2004 Received: with ECARTIS (v1.0.0; list pcp); Thu, 01 Jul 2004 01:56:17 -0700 (PDT) Received: from eads-av-smtp2.gmessaging.net (eads-av-smtp2.gmessaging.net [194.51.201.3]) by oss.sgi.com (8.12.10/8.12.9) with SMTP id i618uEgi006116 for ; Thu, 1 Jul 2004 01:56:15 -0700 Received: from conversion-daemon.eads-relay2.cesson.gm-adm by eads-av-smtp2.gmessaging.net (iPlanet Messaging Server 5.2 HotFix 1.21 (built Sep 8 2003)) id <0I0400201SWVUS@eads-av-smtp2.gmessaging.net> (original mail from nicolas.guillier@airbus.com) for pcp@oss.sgi.com; Wed, 30 Jun 2004 19:16:40 +0200 (MEST) Received: from fr0-mailsp01.res.airbus.corp ([172.24.21.18]) by eads-av-smtp2.gmessaging.net (iPlanet Messaging Server 5.2 HotFix 1.21 (built Sep 8 2003)) with ESMTP id <0I0400GY4TBSFK@eads-av-smtp2.gmessaging.net> for pcp@oss.sgi.com; Wed, 30 Jun 2004 19:16:40 +0200 (MEST) Received: from fr0-mailsp01.res.airbus.corp (localhost.localdomain [127.0.0.1]) by fr0-mailsp01.res.airbus.corp (8.12.10/8.12.10) with ESMTP id i5UHGGg8006430 for ; Wed, 30 Jun 2004 19:16:16 +0200 Received: from fr0-mailrt10.res.airbus.corp ([152.9.126.24]) by fr0-mailsp01.res.airbus.corp (8.12.10/8.12.10) with ESMTP id i5UHGGE5006427 for ; Wed, 30 Jun 2004 19:16:16 +0200 Received: from fr0-mailrt03.res.airbus.corp ([152.9.126.22]) by fr0-mailrt10.res.airbus.corp with Microsoft SMTPSVC(5.0.2195.6713); Wed, 30 Jun 2004 19:16:41 +0200 Received: from FR0-MAILMB10.res.airbus.corp ([152.9.126.5]) by fr0-mailrt03.res.airbus.corp with Microsoft SMTPSVC(5.0.2195.6713); Wed, 30 Jun 2004 19:16:41 +0200 Date: Wed, 30 Jun 2004 19:16:41 +0200 From: "Guillier, Nicolas" Subject: [PCP - Bug?] metric rate higher than 1 ! To: pcp@oss.sgi.com Message-id: <5E3610150FD4454697F8203F72E14D93081592@FR0-MAILMB10.res.airbus.corp> MIME-version: 1.0 X-MIMEOLE: Produced By Microsoft Exchange V6.0.6570.0 Content-type: text/plain Content-transfer-encoding: 7BIT Content-class: urn:content-classes:message Thread-topic: [PCP - Bug?] metric rate higher than 1 ! Thread-index: AcRexgJ0aHNQPPO9Sn+1DWyxc2jcsw== X-MS-Has-Attach: X-MS-TNEF-Correlator: X-OriginalArrivalTime: 30 Jun 2004 17:16:41.0204 (UTC) FILETIME=[02854340:01C45EC6] X-archive-position: 392 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: nicolas.guillier@airbus.com Precedence: bulk X-list: pcp Hello, I use PCP-2.2.2-132 to remotly monitor a linux system. I sometimes face a strange problem: between two acquisitions, the consumed cpu time is higher than the real time ! Once turned into a percentage, the resulting value can reach up to 250% of cpu load ! This case occurs for kernel.cpu.* metrics and with disk.all.avactive metric as well (both from linux pmda). I need to understand the root causes of such a behaviour. Is it due to pmcd, pmda on monitored machine ? or pmcd, pmlogger on remote monitor ? Can I conclude than the consumption reached a peak, or could it just be a pmda failure when updating a metric, trying to read a /proc/ file ? Where can I find information about this ? Thank you. Cordially, Nicolas GUILLIER In-Flight and Ground Information Systems Software Integration - AIRBUS This e-mail is intended only for the above addressee. It may contain privileged information. If you are not the addressee you must not copy, distribute, disclose or use any of the information in it. If you have received it in error please delete it and immediately notify the sender. Security Notice: all e-mail, sent to or from this address, may be accessed by someone other than the recipient, for system management and security reasons. This access is controlled under Regulation of Investigatory Powers Act 2000, Lawful Business Practises. From kenmcd@melbourne.sgi.com Fri Jul 2 00:27:18 2004 Received: with ECARTIS (v1.0.0; list pcp); Fri, 02 Jul 2004 00:27:22 -0700 (PDT) Received: from omx1.americas.sgi.com (cfcafw.sgi.com [198.149.23.1]) by oss.sgi.com (8.12.10/8.12.9) with SMTP id i627RHgi001788 for ; Fri, 2 Jul 2004 00:27:17 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with SMTP id i6203q0f029645 for ; Thu, 1 Jul 2004 19:03:53 -0500 Received: from rattle.melbourne.sgi.com (rattle.melbourne.sgi.com [134.14.55.145]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA17847; Fri, 2 Jul 2004 10:03:48 +1000 Date: Fri, 2 Jul 2004 10:03:48 +1000 From: Ken McDonell To: "Guillier, Nicolas" cc: pcp@oss.sgi.com Subject: Re: [PCP - Bug?] metric rate higher than 1 ! In-Reply-To: <5E3610150FD4454697F8203F72E14D93081592@FR0-MAILMB10.res.airbus.corp> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 393 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: kenmcd@melbourne.sgi.com Precedence: bulk X-list: pcp On Wed, 30 Jun 2004, Guillier, Nicolas wrote: > Hello, > I use PCP-2.2.2-132 to remotly monitor a linux system. > I sometimes face a strange problem: between two acquisitions, the > consumed cpu time is higher than the real time ! Once turned into a > percentage, the resulting value can reach up to 250% of cpu load ! > This case occurs for kernel.cpu.* metrics and with disk.all.avactive > metric as well (both from linux pmda). > > I need to understand the root causes of such a behaviour. First cpu time and disk active time are both really _counters_ in units of time in the kernel, so the reported value for the metric v requires observations at times t1 and t2, then reporting the rate (actually time/time, so a utilization) as v(t2) - v(t1) ------------- t2 - t1 The sort of perturbation you report occurs when the collector system (pmcd + pmdas) is heavily loaded. The collection architecture assigns one timestamp per fetch, and if the collection system is heavily loaded then there is some (non-trivial in the extreme case) time window between when the first value in the fetch is retrieved from the kernel and when the last is retried from the kernel. Let me try to explain with an example with two counter metrics, x and y with correct values as shown below Time x y 0 0 0 1 1 10 2 2 20 3 3 30 4 4 40 5 5 50 6 6 60 7 7 70 8 8 80 Now on a lightly loaded system, if we consider 2 samples at t=1, t=4 and t=7, then the first fetch would return ([x] is the timestamp) Time 1 pcp client sends fetch request pmcd retrieves x=1 and y=10 pcp client receives { [1] x=1 y=10 } 4 pcp client sends fetch pmcd retrieves x=4 and y=40 pcp client receives { [4] x=4 y=40 } 7 pcp client sends fetch pmcd retrieves x=7 and y=70 pcp client receives { [7] x=7 y=70 } And the reported rates would be correct, namely 1 no values available 4 x=(4-1)/3=1 y=(40-10)/3=10 7 x=(7-4)/3=1 y=(70-40)/3=10 Now on a heavily loaded system this could happen ... Time 1 pcp client sends fetch request pmcd retrieves x=1 and y=10 pcp client receives { [1] x=1 y=10 } 4 pcp client sends fetch pmcd retrieves x=4 5 pmcd retrieves y=50 <-- delay pcp client receives { [5] x=4 y=50 } <-- wrong for x 7 pcp client sends fetch pmcd retrieves x=7 and y=70 pcp client receives { [7] x=7 y=70 } And the reported rates would be ... 1 no values available 5 x=(4-1)/4=0.75 y=(50-10)/5=10 7 x=(7-4)/2=1.50 y=(70-50)/2=10 So, the delayed fetch at time 4 (which does not return values until time 5) produces x is too _small_ at t=5 x is too _big_ at t=7 You're noticing the second case. Note that because these are counters, the effects are self-cancelling and diminish over longer sampling intervals. There is nothing inherently wrong here. > Is it due to pmcd, pmda on monitored machine ? or pmcd, pmlogger on > remote monitor ? The effects are all on the collection (monitored) system. > Can I conclude than the consumption reached a peak, or could it just > be a pmda failure when updating a metric, trying to read a /proc/ file ? You cannot really conclude either ... it is just the way things work. Now 250% _is_ extreme, but if the system is totally CPU bound then there is no reason to believe pmcd should be also impacted. > Where can I find information about this ? Hopefully this mail will explain it. I'll add this to the pcp faq on the oss.sgi.com web site. From tichi404@yahoo.com Wed Jul 21 18:01:55 2004 Received: with ECARTIS (v1.0.0; list pcp); Wed, 21 Jul 2004 18:02:00 -0700 (PDT) Received: from web40309.mail.yahoo.com (web40309.mail.yahoo.com [66.218.78.88]) by oss.sgi.com (8.13.0/8.13.0) with SMTP id i6M11tnJ022354 for ; Wed, 21 Jul 2004 18:01:55 -0700 Message-ID: <20040722010148.86481.qmail@web40309.mail.yahoo.com> Received: from [69.81.58.3] by web40309.mail.yahoo.com via HTTP; Wed, 21 Jul 2004 18:01:48 PDT Date: Wed, 21 Jul 2004 18:01:48 -0700 (PDT) From: ti chi Subject: scale/coordinate pmie question To: pcp@oss.sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 396 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: tichi404@yahoo.com Precedence: bulk X-list: pcp Content-Length: 1046 Lines: 33 have many web customers monitored with pcp – everything put on one master server with a directory for each customer (logger, pmie, etc.) so the customers are silo’ed so to speak into almost 100 directories. this layout scale things well since have almost 700 servers monitored. the problem is we want to coordinate pmie alerts/events that occur in the customer environments with our environment – basically a global pmie. don’t know good solution since: not practical to copy/merge all customers pmlogger files to our system and run pmie on logger file – this too much data and takes too long. not practical to have our pmie or pmlogger query all customer pmda – to much bandwidth and slams pmie. so silo’ed environment solved scale problem but created uncoordination of events - any thoughts on good way to coordinate pmie for whole environment? thanks for any insight! -ti __________________________________ Do you Yahoo!? Vote for the stars of Yahoo!'s next ad campaign! http://advision.webevents.yahoo.com/yahoo/votelifeengine/ From tichi404@yahoo.com Thu Jul 22 18:10:42 2004 Received: with ECARTIS (v1.0.0; list pcp); Thu, 22 Jul 2004 18:11:56 -0700 (PDT) Received: from web40312.mail.yahoo.com (web40312.mail.yahoo.com [66.218.78.91]) by oss.sgi.com (8.13.0/8.13.0) with SMTP id i6N1AgTu015853 for ; Thu, 22 Jul 2004 18:10:42 -0700 Message-ID: <20040723011035.82028.qmail@web40312.mail.yahoo.com> Received: from [24.145.178.88] by web40312.mail.yahoo.com via HTTP; Thu, 22 Jul 2004 18:10:35 PDT Date: Thu, 22 Jul 2004 18:10:35 -0700 (PDT) From: ti chi Subject: Re: scale/coordinate pmie question To: pcp@oss.sgi.com In-Reply-To: <20040722010148.86481.qmail@web40309.mail.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 397 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: tichi404@yahoo.com Precedence: bulk X-list: pcp Content-Length: 1663 Lines: 54 been thinking much about this and have idea – what about a “event” pmda that coordinate and forwards pmie exception “events” to master server? when different customer running different pmie’s have a exception in their environment – a pmie “action” can forward the exception message to the event pmda that sends it to master server. big problem would be getting those events into master pmie rule engine – not clear how to do but looking at pmie code. anyone tried this? do any pcp experts have thoughts on this? -ti --- ti chi wrote: > have many web customers monitored with pcp – > everything put on one master server with a directory > for each customer (logger, pmie, etc.) so the > customers are silo’ed so to speak into almost 100 > directories. > > this layout scale things well since have almost 700 > servers monitored. the problem is we want to > coordinate pmie alerts/events that occur in the > customer environments with our environment – > basically > a global pmie. > > don’t know good solution since: not practical to > copy/merge all customers pmlogger files to our > system > and run pmie on logger file – this too much data and > takes too long. not practical to have our pmie or > pmlogger query all customer pmda – to much bandwidth > and slams pmie. > > so silo’ed environment solved scale problem but > created uncoordination of events - any thoughts on > good way to coordinate pmie for whole environment? > thanks for any insight! > > -ti __________________________________ Do you Yahoo!? Vote for the stars of Yahoo!'s next ad campaign! http://advision.webevents.yahoo.com/yahoo/votelifeengine/ From kjw@pocket.rightsock.com Fri Jul 23 10:00:17 2004 Received: with ECARTIS (v1.0.0; list pcp); Fri, 23 Jul 2004 10:00:22 -0700 (PDT) Received: from mordred.punk.net (mordred.punk.net [216.218.194.216]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id i6NH0G6X018051 for ; Fri, 23 Jul 2004 10:00:17 -0700 Received: from pocket.rightsock.com (localhost [127.0.0.1]) by mordred.punk.net (8.12.10/8.12.10) with ESMTP id i6NGss18007262; Fri, 23 Jul 2004 10:00:08 -0700 (PDT) (envelope-from kjw@pocket.rightsock.com) Received: (from kjw@localhost) by pocket.rightsock.com (8.12.8/8.12.8/Submit) id i6M3s32F017085; Wed, 21 Jul 2004 20:54:03 -0700 Date: Wed, 21 Jul 2004 20:54:03 -0700 From: Kevin Wang To: ti chi Cc: pcp@oss.sgi.com Subject: Re: scale/coordinate pmie question Message-ID: <20040722035403.GA17028@rightsock.com> References: <20040722010148.86481.qmail@web40309.mail.yahoo.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040722010148.86481.qmail@web40309.mail.yahoo.com> User-Agent: Mutt/1.4.1i X-archive-position: 398 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: kjw@rightsock.com Precedence: bulk X-list: pcp Content-Length: 1350 Lines: 33 From ti chi > have many web customers monitored with pcp ? > everything put on one master server with a directory > for each customer (logger, pmie, etc.) so the > customers are silo?ed so to speak into almost 100 > directories. > > this layout scale things well since have almost 700 > servers monitored. the problem is we want to > coordinate pmie alerts/events that occur in the > customer environments with our environment ? basically > a global pmie. > > don?t know good solution since: not practical to > copy/merge all customers pmlogger files to our system > and run pmie on logger file ? this too much data and > takes too long. not practical to have our pmie or > pmlogger query all customer pmda ? to much bandwidth > and slams pmie. > > so silo?ed environment solved scale problem but > created uncoordination of events - any thoughts on > good way to coordinate pmie for whole environment? > thanks for any insight! I'd also be generally interested in hearing about any scaling work that's been done. I'm trying (probably in vain) to get pcp adopted at my company. we have thousands of hosts per cluster, and would need massive scalability. I'm not sure if it's possible. I may be able to aggregate the subparts of the cluster (front end, back end, logger) into their own sub-clusters, but still, thousands is hard. - Kevin From kjw@pocket.rightsock.com Fri Jul 23 10:00:37 2004 Received: with ECARTIS (v1.0.0; list pcp); Fri, 23 Jul 2004 10:00:42 -0700 (PDT) Received: from mordred.punk.net (mordred.punk.net [216.218.194.216]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id i6NH0bOL018064 for ; Fri, 23 Jul 2004 10:00:37 -0700 Received: from pocket.rightsock.com (localhost [127.0.0.1]) by mordred.punk.net (8.12.10/8.12.10) with ESMTP id i6NGss1E007262; Fri, 23 Jul 2004 10:00:30 -0700 (PDT) (envelope-from kjw@pocket.rightsock.com) Received: (from kjw@localhost) by pocket.rightsock.com (8.12.8/8.12.8/Submit) id i6N6sFQG030578; Thu, 22 Jul 2004 23:54:15 -0700 Date: Thu, 22 Jul 2004 23:54:15 -0700 From: Kevin Wang To: ti chi Cc: pcp@oss.sgi.com Subject: Re: scale/coordinate pmie question Message-ID: <20040723065415.GA30501@rightsock.com> References: <20040722010148.86481.qmail@web40309.mail.yahoo.com> <20040723011035.82028.qmail@web40312.mail.yahoo.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040723011035.82028.qmail@web40312.mail.yahoo.com> User-Agent: Mutt/1.4.1i X-archive-position: 399 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: kjw@rightsock.com Precedence: bulk X-list: pcp Content-Length: 1607 Lines: 38 From ti chi > been thinking much about this and have idea ? what > about a ?event? pmda that coordinate and forwards pmie > exception ?events? to master server? > > when different customer running different pmie?s have > a exception in their environment ? a pmie ?action? can > forward the exception message to the event pmda that > sends it to master server. big problem would be > getting those events into master pmie rule engine ? > not clear how to do but looking at pmie code. > > anyone tried this? do any pcp experts have thoughts > on this? Hm, that doesn't sound quite right. pcp/pmdas are data gathering and pmie reads the data and makes decisions. what sounds more appropriate is if you write an interrupt based system on top of that. something like pmie calling a web page cgi, and that would do other types of data consolidation, but that means writing it from scratch, implementing your own thing, and probably not being terribly good. now I know that you can create a consolidation pmda, to summarize a bunch of information, but you'll need to be able to come up with a single number that can represent the rest of the machines. that unfortunately hides any actual issue since all you have is the summary, and not an individual number. It may still be good to do. that way you can at least record a "total" number of web requests per minute or something like that. (assuming you're monitoring the web servers with pcp. Hm. have to think about this some more. tiering is obviously needed for computational overhead, yet pcp doesn't really seem to know how to do that? - Kevin From kenmcd@melbourne.sgi.com Sun Jul 25 17:36:54 2004 Received: with ECARTIS (v1.0.0; list pcp); Sun, 25 Jul 2004 17:37:00 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id i6Q0asml001214 for ; Sun, 25 Jul 2004 17:36:54 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with SMTP id i6Q0an0f011807 for ; Sun, 25 Jul 2004 19:36:50 -0500 Received: from ppp-kenmcd.melbourne.sgi.com (ppp-kenmcd.melbourne.sgi.com [134.14.52.219]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA26735; Mon, 26 Jul 2004 10:36:44 +1000 Date: Mon, 26 Jul 2004 10:34:01 +1000 (EST) From: kenmcd@melbourne.sgi.com Reply-To: Ken McDonell To: ti chi cc: pcp@oss.sgi.com Subject: Re: scale/coordinate pmie question In-Reply-To: <20040722010148.86481.qmail@web40309.mail.yahoo.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by oss.sgi.com id i6Q0asml001214 X-archive-position: 400 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: kenmcd@melbourne.sgi.com Precedence: bulk X-list: pcp Content-Length: 1713 Lines: 40 On Wed, 21 Jul 2004, ti chi wrote: > have many web customers monitored with pcp – > everything put on one master server with a directory > for each customer (logger, pmie, etc.) so the > customers are silo’ed so to speak into almost 100 > directories. > > this layout scale things well since have almost 700 > servers monitored. the problem is we want to > coordinate pmie alerts/events that occur in the > customer environments with our environment – basically > a global pmie. > > don’t know good solution since: not practical to > copy/merge all customers pmlogger files to our system > and run pmie on logger file – this too much data and > takes too long. not practical to have our pmie or > pmlogger query all customer pmda – to much bandwidth > and slams pmie. > > so silo’ed environment solved scale problem but > created uncoordination of events - any thoughts on > good way to coordinate pmie for whole environment? > thanks for any insight! I would think you'd want to use as many distributed pmies as makes sense in terms of network bandwidth and management complexity to filter the data close to the machines you are monitoring, and then use the pmia alarm mechanism to forward just the alerts to a central event clearinghouse. Look at the pmie examples ... disk.00, disk.20 or uag.20 for hints on how the "shell" action could be used to forward events to your central alert management system ... of course you'd need a mechanism for sending events on every managed system, and some way of accumulating and dealing with those events at the central point ... e-mail is a quick and dirty way of doing this to prototype the idea ... other similar schemes can be devised with not much effort. From kenmcd@melbourne.sgi.com Sun Jul 25 17:42:48 2004 Received: with ECARTIS (v1.0.0; list pcp); Sun, 25 Jul 2004 17:42:52 -0700 (PDT) Received: from omx1.americas.sgi.com (omx1-ext.sgi.com [192.48.179.11]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id i6Q0gjoq001275 for ; Sun, 25 Jul 2004 17:42:48 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by omx1.americas.sgi.com (8.12.10/8.12.9/linux-outbound_gateway-1.1) with SMTP id i6Q0gf0f014375 for ; Sun, 25 Jul 2004 19:42:42 -0500 Received: from ppp-kenmcd.melbourne.sgi.com (ppp-kenmcd.melbourne.sgi.com [134.14.52.219]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA26792; Mon, 26 Jul 2004 10:42:37 +1000 Date: Mon, 26 Jul 2004 10:39:54 +1000 (EST) From: kenmcd@melbourne.sgi.com Reply-To: Ken McDonell To: ti chi cc: pcp@oss.sgi.com Subject: Re: scale/coordinate pmie question In-Reply-To: <20040723011035.82028.qmail@web40312.mail.yahoo.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by oss.sgi.com id i6Q0gjoq001275 X-archive-position: 401 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: kenmcd@melbourne.sgi.com Precedence: bulk X-list: pcp Content-Length: 1305 Lines: 33 On Thu, 22 Jul 2004, ti chi wrote: > > been thinking much about this and have idea – what > about a “event” pmda that coordinate and forwards pmie > exception “events” to master server? > > when different customer running different pmie’s have > a exception in their environment – a pmie “action” can > forward the exception message to the event pmda that > sends it to master server. big problem would be > getting those events into master pmie rule engine – > not clear how to do but looking at pmie code. > > anyone tried this? do any pcp experts have thoughts > on this? You've suggested something very simiilar to my initial response, although couched 100% in PCP protocol framework. One potential difficulty with using a nested PMDA like this is that the pmcd-pmda protocols are inherently synchronous, so you'd need to make the "event" pmda multi-threaded (or mulit-process) with one side communicating to pmcd and the other listening for events from the remote pmies. Using smtp and a simple mail filter might be a quicker way to cobble this together ... you could simply append the mail to a log file at the central location and modify the weblog PMDA to export information about the incoming alerts if you don't have some other form of event handling framework at the central system. From forrest.zhao@intel.com Mon Jul 26 23:20:39 2004 Received: with ECARTIS (v1.0.0; list pcp); Mon, 26 Jul 2004 23:20:45 -0700 (PDT) Received: from hermes.jf.intel.com (fmr05.intel.com [134.134.136.6]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id i6R6Kcs2004751 for ; Mon, 26 Jul 2004 23:20:38 -0700 Received: from petasus.jf.intel.com (petasus.jf.intel.com [10.7.209.6]) by hermes.jf.intel.com (8.12.9-20030918-01/8.12.9/d: major-outer.mc,v 1.15 2004/01/30 18:16:28 root Exp $) with ESMTP id i6R6M132019107 for ; Tue, 27 Jul 2004 06:22:01 GMT Received: from pdsmsxvs01.pd.intel.com (pdsmsxvs01.pd.intel.com [172.16.12.122]) by petasus.jf.intel.com (8.12.9-20030918-01/8.12.9/d: major-inner.mc,v 1.10 2004/03/01 19:21:36 root Exp $) with SMTP id i6R6MHFR002097 for ; Tue, 27 Jul 2004 06:22:31 GMT Received: from pdsmsx331.ccr.corp.intel.com ([172.16.12.58]) by pdsmsxvs01.pd.intel.com (SAVSMTP 3.1.2.35) with SMTP id M2004072714202819597 for ; Tue, 27 Jul 2004 14:20:28 +0800 Received: from pdsmsx402.ccr.corp.intel.com ([172.16.12.50]) by pdsmsx331.ccr.corp.intel.com with Microsoft SMTPSVC(6.0.3790.0); Tue, 27 Jul 2004 14:20:28 +0800 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.5.6944.0 Subject: ftp://oss.sgi.com is not accessible Date: Tue, 27 Jul 2004 14:20:27 +0800 Message-ID: <3AA03342E913FA4BA6D8BD0732BFC74B020F45B3@pdsmsx402.pd.intel.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: ftp://oss.sgi.com is not accessible Thread-Index: AcRzoc9Gm6XB+amAQG2iE1c9R1lx6g== From: "Zhao, Forrest" To: X-OriginalArrivalTime: 27 Jul 2004 06:20:28.0602 (UTC) FILETIME=[CFC639A0:01C473A1] X-Scanned-By: MIMEDefang 2.31 (www . roaringpenguin . com / mimedefang) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id i6R6Kcs2004751 X-archive-position: 403 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: forrest.zhao@intel.com Precedence: bulk X-list: pcp Content-Length: 3328 Lines: 100 Hi, I tried to download latest pcp from ftp://oss.sgi.com/projects/pcp/download. But the URL is not accessible, and the log of my ftp client is: COMMAND:> USER anonymous@oss.sgi.com 230-(----GATEWAY CONNECTED TO oss.sgi.com----) 230-(220---------- Welcome to Pure-FTPd ----------) 230-(220-You are user number 4 of 50 allowed.) 230-(220-Local time is now 23:16. Server port: 21.) 230-(220 You will be disconnected after 15 minutes of inactivity.) 230-(230-) 230-(230-) 230-(230--------------------------------------------------------) 230-(230-Welcome to the SGI open source repository.) 230-(230-web, ftp and cvs roots are shared for your convenience) 230-(230-Thanks for visiting SGI, may the source be with you.) 230-(230--------------------------------------------------------) 230-(230-) 230-(230-Note that, as provided in the License, the Software is distributed on an) 230-(230-"AS IS" basis, with ALL EXPRESS AND IMPLIED WARRANTIES AND CONDITIONS) 230-(230-DISCLAIMED, INCLUDING, WITHOUT LIMITATION, ANY IMPLIED WARRANTIES AND) 230-(230-CONDITIONS OF MERCHANTABILITY, SATISFACTORY QUALITY, FITNESS FOR A) 230-(230-PARTICULAR PURPOSE, AND NON-INFRINGEMENT.) 230-(230-) 230-(230-All transfers are logged with your host name and whatever you) 230-(230-entered for the (FTP) password. If you're unwilling to abide by this) 230-(230-policy, please disconnect now.) 230-(230-) 230-(230-Due to U.S. Exports Regulations, all cryptographic software on this) 230-(230-site is subject to the following legal notice:) 230-(230-) 230-(230- This site includes publicly available encryption source code) 230-(230- which, together with object code resulting from the compiling of) 230-(230- publicly available source code, may be exported from the United) 230-(230- States under License Exception "TSU" pursuant to 15 C.F.R. Section) 230-(230- 740.13(e).) 230-(230-) 230-(230-) 230-(230-This legal notice applies to cryptographic software only. Please see) 230-(230-the Bureau of Export Administration (http://www.bxa.doc.gov/) for more) 230-(230-information about current U.S. regulations.) 230-(230-) 230-(230-Note:) 230-(230-If you get a "garbage looking" file when trying to download) 230-(230-try to use SHIFT while clicking on the file name. It'll force) 230-(230-a "Save as File" in Netscape. Also, it is possible that when) 230-(230-you click on a file with the "rpm" extension, Netscape will) 230-(230-launch Real Audio. In this case, you want to use the shift) 230-(230-trick to download the file as well.) 230 Anonymous user logged in STATUS:> Login successful. STATUS:> Waiting 30 seconds... STATUS:> ============== Attempt #1 ============== COMMAND:> USER anonymous@oss.sgi.com 230 Anonymous user logged in STATUS:> Login successful. STATUS:> Waiting 30 seconds... STATUS:> ============== Attempt #2 ============== COMMAND:> USER anonymous@oss.sgi.com 230 Anonymous user logged in STATUS:> Login successful. STATUS:> Waiting 30 seconds... STATUS:> ============== Attempt #3 ============== COMMAND:> USER anonymous@oss.sgi.com 230 Anonymous user logged in STATUS:> Login successful. STATUS:> Waiting 30 seconds... Could anyone tell me how to resolve this problem? Thanks, Forrest From kaos@sgi.com Mon Jul 26 23:43:39 2004 Received: with ECARTIS (v1.0.0; list pcp); Mon, 26 Jul 2004 23:43:45 -0700 (PDT) Received: from zok.sgi.com (mtvcafw.sgi.com [192.48.171.6]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id i6R6hcnx005135 for ; Mon, 26 Jul 2004 23:43:39 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by zok.sgi.com (8.12.9/8.12.9/linux-outbound_gateway-1.1) with SMTP id i6R6hShv022946 for ; Mon, 26 Jul 2004 23:43:29 -0700 Received: from kao2.melbourne.sgi.com (kao2.melbourne.sgi.com [134.14.55.180]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id QAA29358 for ; Tue, 27 Jul 2004 16:43:27 +1000 Received: by kao2.melbourne.sgi.com (Postfix, from userid 16331) id 28D1FC2172; Tue, 27 Jul 2004 16:43:26 +1000 (EST) Received: from kao2.melbourne.sgi.com (localhost [127.0.0.1]) by kao2.melbourne.sgi.com (Postfix) with ESMTP id 256AA140101; Tue, 27 Jul 2004 16:43:26 +1000 (EST) X-Mailer: exmh version 2.6.3_20040314 03/14/2004 with nmh-1.0.4 From: Keith Owens To: "Zhao, Forrest" Cc: pcp@oss.sgi.com Subject: Re: ftp://oss.sgi.com is not accessible In-reply-to: Your message of "Tue, 27 Jul 2004 14:20:27 +0800." <3AA03342E913FA4BA6D8BD0732BFC74B020F45B3@pdsmsx402.pd.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 27 Jul 2004 16:43:25 +1000 Message-ID: <10119.1090910605@kao2.melbourne.sgi.com> X-archive-position: 404 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: kaos@sgi.com Precedence: bulk X-list: pcp Content-Length: 1291 Lines: 34 On Tue, 27 Jul 2004 14:20:27 +0800, "Zhao, Forrest" wrote: >Hi, > >I tried to download latest pcp from >ftp://oss.sgi.com/projects/pcp/download. > >But the URL is not accessible, and the log of my ftp client is: > >COMMAND:> USER anonymous@oss.sgi.com > 230-(----GATEWAY CONNECTED TO oss.sgi.com----) > 230-(220---------- Welcome to Pure-FTPd ----------) > 230-(220-You are user number 4 of 50 allowed.) > 230-(220-Local time is now 23:16. Server port: 21.) > 230-(220 You will be disconnected after 15 minutes of > 230-(230-) >.... > 230-(230-trick to download the file as well.) > 230 Anonymous user logged in >STATUS:> Login successful. >STATUS:> Waiting 30 seconds... >STATUS:> ============== Attempt #1 ============== >COMMAND:> USER anonymous@oss.sgi.com > 230 Anonymous user logged in Looks like a problem with your ftp client. It connected to oss.sgi.com, got the full response (i.e. this is not a path MTU problem) and successfully logged in as anonymous (230 response was returned). Then your client timed out and kept resending the userid. Your ftp client is expecting something that is not being sent by oss. Which ftp clent are you using, and can you try another client? Linux ftp and ncftp work fine when talking to oss.sgi.com. From forrest.zhao@intel.com Tue Jul 27 00:25:07 2004 Received: with ECARTIS (v1.0.0; list pcp); Tue, 27 Jul 2004 00:25:16 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id i6R7P6NL015858 for ; Tue, 27 Jul 2004 00:25:06 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by caduceus.jf.intel.com (8.12.9-20030918-01/8.12.9/d: major-outer.mc,v 1.15 2004/01/30 18:16:28 root Exp $) with ESMTP id i6R7OPiX026846; Tue, 27 Jul 2004 07:24:26 GMT Received: from pdsmsxvs01.pd.intel.com (pdsmsxvs01.pd.intel.com [172.16.12.122]) by talaria.jf.intel.com (8.12.9-20030918-01/8.12.9/d: major-inner.mc,v 1.10 2004/03/01 19:21:36 root Exp $) with SMTP id i6R7Kb2j031081; Tue, 27 Jul 2004 07:20:39 GMT Received: from pdsmsx331.ccr.corp.intel.com ([172.16.12.58]) by pdsmsxvs01.pd.intel.com (SAVSMTP 3.1.2.35) with SMTP id M2004072715245420262 ; Tue, 27 Jul 2004 15:24:54 +0800 Received: from pdsmsx402.ccr.corp.intel.com ([172.16.12.50]) by pdsmsx331.ccr.corp.intel.com with Microsoft SMTPSVC(6.0.3790.0); Tue, 27 Jul 2004 15:24:54 +0800 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.5.6944.0 Subject: RE: ftp://oss.sgi.com is not accessible Date: Tue, 27 Jul 2004 15:24:53 +0800 Message-ID: <3AA03342E913FA4BA6D8BD0732BFC74B020F45B4@pdsmsx402.pd.intel.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: ftp://oss.sgi.com is not accessible Thread-Index: AcRzpRpsQ+XVJd3oTP2vqo43XnB0UgABYuJQ From: "Zhao, Forrest" To: "Keith Owens" Cc: X-OriginalArrivalTime: 27 Jul 2004 07:24:54.0628 (UTC) FILETIME=[D01AFA40:01C473AA] X-Scanned-By: MIMEDefang 2.31 (www . roaringpenguin . com / mimedefang) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id i6R7P6NL015858 X-archive-position: 405 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: forrest.zhao@intel.com Precedence: bulk X-list: pcp Content-Length: 1591 Lines: 51 Hi, Keith Thank you. After changing a new ftp client, I got the tar package :) Best wishes, Forrest -----Original Message----- From: Keith Owens [mailto:kaos@sgi.com] Sent: Tuesday, July 27, 2004 2:43 PM To: Zhao, Forrest Cc: pcp@oss.sgi.com Subject: Re: ftp://oss.sgi.com is not accessible On Tue, 27 Jul 2004 14:20:27 +0800, "Zhao, Forrest" wrote: >Hi, > >I tried to download latest pcp from >ftp://oss.sgi.com/projects/pcp/download. > >But the URL is not accessible, and the log of my ftp client is: > >COMMAND:> USER anonymous@oss.sgi.com > 230-(----GATEWAY CONNECTED TO oss.sgi.com----) > 230-(220---------- Welcome to Pure-FTPd ----------) > 230-(220-You are user number 4 of 50 allowed.) > 230-(220-Local time is now 23:16. Server port: 21.) > 230-(220 You will be disconnected after 15 minutes of > 230-(230-) >.... > 230-(230-trick to download the file as well.) > 230 Anonymous user logged in >STATUS:> Login successful. >STATUS:> Waiting 30 seconds... >STATUS:> ============== Attempt #1 ============== >COMMAND:> USER anonymous@oss.sgi.com > 230 Anonymous user logged in Looks like a problem with your ftp client. It connected to oss.sgi.com, got the full response (i.e. this is not a path MTU problem) and successfully logged in as anonymous (230 response was returned). Then your client timed out and kept resending the userid. Your ftp client is expecting something that is not being sent by oss. Which ftp clent are you using, and can you try another client? Linux ftp and ncftp work fine when talking to oss.sgi.com. From kenmcd@melbourne.sgi.com Fri Jul 30 15:28:43 2004 Received: with ECARTIS (v1.0.0; list pcp); Fri, 30 Jul 2004 15:28:48 -0700 (PDT) Received: from omx2.sgi.com (omx2-ext.SGI.COM [192.48.171.19] (may be forged)) by oss.sgi.com (8.13.0/8.13.0) with ESMTP id i6UMSgCY010567 for ; Fri, 30 Jul 2004 15:28:43 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by omx2.sgi.com (8.12.11/8.12.9/linux-outbound_gateway-1.1) with SMTP id i6UNTusS032604 for ; Fri, 30 Jul 2004 16:29:56 -0700 Received: from ppp-kenmcd.melbourne.sgi.com (ppp-kenmcd.melbourne.sgi.com [134.14.52.219]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id IAA03153; Sat, 31 Jul 2004 08:28:32 +1000 Date: Sat, 31 Jul 2004 08:25:25 +1000 (EST) From: kenmcd@melbourne.sgi.com Reply-To: Ken McDonell To: Mark_H_Johnson@Raytheon.com cc: pcp@oss.sgi.com Subject: Re: Query on cluster measurement In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 406 X-ecartis-version: Ecartis v1.0.0 Sender: pcp-bounce@oss.sgi.com Errors-to: pcp-bounce@oss.sgi.com X-original-sender: kenmcd@melbourne.sgi.com Precedence: bulk X-list: pcp Content-Length: 2893 Lines: 77 This is in respect of some very old mail, but please read on ... I've been working on a new PCP component (pmproxy) that can be used as a surrogate pmcd. In the Mark's picture below, pmproxy would be deployed on the Head Node and forward requests from the Workstations onto the Compute Nodes, responses from the compute nodes are returned to the Workstations. Some libpcp changes use PMPROXY_HOST and PMPROXY_PORT environment variables to change the behaviour of pmNewContext() so that it connects to pmproxy rather than pmcd on the Compute Nodes. This code is now working, and I'd like to enlist any volunteers who'd be willing to try it out before we include it in the mainstream PCP releases. If you're interested, please contact me directly. On Fri, 3 Aug 2001 Mark_H_Johnson@Raytheon.com wrote: > We are looking at using PCP for measuring information on our cluster of > PC's and have a few questions... > > To set the stage, our network looks something like... > > > Workstation(s) > | | | | > ---------+---------- > | > Head Node > | > Switch (private LAN) > | > ---------+---------- > | | | | > Compute Nodes > | | | | > Other Equipment > > The head node is NOT a router - workstations can't see the compute nodes > (nor the other equipment) with TCP/IP. > > We would prefer to run the monitoring tools on one or more workstations. We > would prefer to run the agents on both the compute nodes and head node. We > would prefer to collect the data at the head node for distribution to the > workstations. [I think I got the terminology right...] All the machines are > running Linux, and we have PCP 2.2.1 downloaded and installed on all of the > machines that will be doing this. > > (1) In a few places, the documentation says that the collector works with > local agents. But in the man page for pmcd(1), it indicates that socket > connections are supported. Is there some way we can gather key data items > from the compute node, send them to the head node [socket connection?] & > include them in the head node's name space? If not, do you have suggestions > for implementing such a capability? > > (2) In lieu of an elegant solution to (1) - could we use remote shell to > the compute nodes, use pminfo to the dump data & import w/ the ASCII > interface to pmcd? > > (3) We want to measure data transfer rates to the other equipment. We were > looking at getting data out of /proc, but we have function interfaces > available as well. Should we just filter the /proc output similar to that > done by the Linux agent or use code instead? > > (4) Was there additional work done in ACE (Advanced Cluster Environment) > that may have implemented this already? If so, who should we contact at SGI > for more information? > > Thanks. > --Mark H Johnson > >