pcp
[Top] [All Lists]

heavy use of /tmp slowing down pmlogconf etc.

To: pcp developers <pcp@xxxxxxxxxxx>
Subject: heavy use of /tmp slowing down pmlogconf etc.
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Mon, 16 May 2016 12:28:25 -0400
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <20160514193945.GC1418@xxxxxxxxxx>
References: <20160514193945.GC1418@xxxxxxxxxx>
User-agent: Mutt/1.4.2.2i
Hi -

While analyzing a larger server that oversees a pmmgr/pmlogger farm of
some hundreds of hosts, I just came across another uncomfortable
finding.  Even with the pmlogconf $PCP_DERIVED_CONFIG= override (as
present now on the fche/multithread branch commit #e56acaae4),
a pmlogconf job -still- can bog down to minutes of elapsed time.

This time, caught it in the act with strace (again).  It turns out
that the way the pmlogconf.sh shell script does a serious amount of
file-system traffic into /tmp, assembling little bits of pmlogger
configuration, we can suffer some amazing latencies.  For a time when
some dozens of pmlogconf's were running in parallel, creation of
new /tmp files took serious fractions of seconds!

An "strace -r -tt ..." rencered this section of the shell script:

    293             $PCP_AWK_PROG <"$config" >>$tmp/head '
    294 BEGIN                   { tag="'"$tag"'" }
    295 $1 == "#+" && $2 ~ tag  { want = 1 }
    296 want == 1               { print }
    297 want == 1 && /^#----/   { exit }'
    298             cat $tmp/head $tmp/tail >$tmp/ctl
    299             continue
    300         fi
    301 

with 

[pid 21480]      0.000063 open("/tmp/pcp.FzvfitYEA/ctl", 
O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 21480]      0.257356 dup2(3, 1)    = 1
[pid 21480]      0.000299 close(3)      = 0
[pid 21480]      0.000352 execve("/usr/bin/cat", ["cat", 
"/tmp/pcp.FzvfitYEA/head", "/tmp/pcp.FzvfitYEA/tail"], [/* 59 vars */]) = 0
[pid 21480]      0.000213 brk(0)        = 0x14b0000
[...]

Noticed that 0.257356 number?  That means that it took the kernel
257ms just to create a single file in /tmp.  (This was on a RHEL7-ish
kernel on a VM, with /tmp on ext4.)  It seems like there is such high
use of temp files & thus contention on kernel resources, that such
latencies can appear, over and over and over again.

What can we do to improve this?

- nothing (boo!)

- tweak pmlogconf to reduce /tmp usage (maybe use more shell variables
  instead of files?)

- rewrite pmlogconf in e.g. python, backward-compatible

- write a pmlogconf-new in e.g. python, not backward-compatible or
  feature-complete, maybe just for non-interactive use like from pmmgr


(See also prior concerns about pmprintf's use of /tmp, as well as the
compressed-archive implementation by decompression to /tmp.)


- FChE

<Prev in Thread] Current Thread [Next in Thread>