pcp
[Top] [All Lists]

Re: [pcp] [RFC] Collect additional metrics from /proc/net/netstat

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: [pcp] [RFC] Collect additional metrics from /proc/net/netstat
From: Michele Baldessari <michele@xxxxxxxxxx>
Date: Thu, 15 May 2014 13:42:13 +0100
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/simple; d=acksyn.org; h= user-agent:in-reply-to:content-disposition:content-type :content-type:mime-version:references:message-id:subject:subject :from:from:date:date:received:received; s=2010; t=1400157734; bh=ZcWaJMsBOrqwekedVb0YMxb5w64hVF9n1e1qKmYuWCo=; b=YAlA5mDH1bRK ZjddMa+1PIU5sXMk620pcghr1rwWUsMkd8txicb6hfDKXkRDo3TNN0wrQtmaUb3G MTnAdWPPXdK4JqKPxzbwI/eg/2iYxuQgPvCXVBcTwu3AbYj2//PDNZ9hNP3u13zq MVWhrCOSgqiWdNeonNWMvkttpMtA3VM=
In-reply-to: <724194126.7674818.1400136552701.JavaMail.zimbra@xxxxxxxxxx>
References: <1400111497-2534-1-git-send-email-michele@xxxxxxxxxx> <724194126.7674818.1400136552701.JavaMail.zimbra@xxxxxxxxxx>
User-agent: Mutt/1.5.21 (2012-12-30)
Hi Nathan,

On Thu, May 15, 2014 at 02:49:12AM -0400, Nathan Scott wrote:
> > this patch mimics src/pmdas/linux/proc_net_snmp.[ch] and extends the
> > network.{tcp,ip} namespaces in order to get a better glimpse at the
> > behaviour of the Linux network stack.
> > 
> > The rationale behind these new metrics is that these days, in order
> > to troubleshoot network performance issues you really need to look at
> > all of these metrics to understand why a certain workload is not
> > performing as expected. I will also submit two other separate PMDAs
> > (sctp and ethtool -S) later on.
> 
> Very cool, thanks!
> 
> > RFC since a) I just started reading up on PCP and b) I am unsure
> > if it is okay to add ~120 metrics in the Linux PMDA or if it is
> > preferable to put these in a separate one.
> > 
> > Any feedback is appreciated.
> 
> I'll review more closely once this current release is out the door,
> but the answer to b) is "its fine".  If/when we do reach a point down
> the track where we collectively decide there's too many metrics in a
> particular PMDA, there are approaches we've used before to split 'em
> up (while preserving the namespace, transitioning PMIDs safely, etc).
> 
> A few other things to think about at this stage:
> 
> - it'd be a good idea to consider whether all/some/none of these new
> metrics should be in the default-logged (by the system pmlogger) set
> of metrics (and what sort of sampling interval would suit);

Good point. I'll add some defaults with rationale in my next submission

> - whether default or not, a src/pmlogconf/tools/netstat file (or some
> other similarly named file closeby there) would be a welcome addition
> to help people record these metrics;

Yup. Duly noted.

> - in terms of automated testing, the qa/957 test case automatically
> will use valgrind to check your new additions to the kernel PMDA, but
> we should add 2 new (similar) tests for the other two new PMDAs (I'll
> help you with that once they exist if you like);
> 
> - we should also check the actual values being exported match what
> we expect.  In some cases we've used other tools for verifying (see
> qa/635 for a networking example) but I have another, more generic
> approach planned for the kernel metrics (incl. these) which I hope
> to tackle next release - leave that with me & I'll get back to you
> on that one, I may need just some test case data from you there.

Ok, I'll make sure to add some qa tests when submitting the non-rfc
version.

My current plan is more or less the following:
- /proc/net/netstat
Extend existing linux PMDA. Extends network.{tcp,ip} namespaces

- /proc/net/sctp/snmp and /proc/net/sctp/assocs
New separate SCTP PMDA (separate as in 99% of cases, it is of zero interest)
Creates new network.sctp namespace and contains generic SCTP stack stats
and indoms for specific association data

- ethtool -S
Separate PMDA in python (for now). Creates the new network.ethtool.
namespace. This one seems to be the most complex in terms of NMS
modelling. More on this when I have something a bit more consumable.

- /proc/net/softnet_stat
This helps for scheduling issue. Will start by stuffing it in the Linux
PMDA. We can reconsider once we see the amount of metrics.

- /proc/net/udp and /proc/net/udp6
UDP counters. Will start by stuffing it in the Linux
PMDA. We can reconsider once we see the amount of metrics. 

- /proc/net/tcp6
We currently have /proc/net/tcp and are missing the ipv6 version

- /proc/net/snmp6
We currently have /proc/net/snmp and are missing the ipv6 counterpart

- /proc/buddyinfo
It's good to collect fragmentation stats over time as some drivers have
been known to misbehave under the slightest memory pressure conditions.

We'll then need to see what makes sense to be collected by default and
with what intervals and what not.
Anyway, thanks for the feedback. I'll publish a proper reviewable
git tree with my progress on this and post it here.

regards,
Michele
-- 
Michele Baldessari            <michele@xxxxxxxxxx>
C2A5 9DA3 9961 4FFB E01B  D0BC DDD4 DCCB 7515 5C6D

<Prev in Thread] Current Thread [Next in Thread>