pcp
[Top] [All Lists]

Re: [pcp] Performance of parsing an archive in python

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: [pcp] Performance of parsing an archive in python
From: Michele Baldessari <michele@xxxxxxxxxx>
Date: Thu, 30 Oct 2014 08:53:48 +0100
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/simple; d=acksyn.org; h= user-agent:in-reply-to:content-disposition:content-type :content-type:mime-version:references:message-id:subject:subject :from:from:date:date:received:received; s=2010; t=1414655630; bh=KPAhApN1h1mT4i67ewKUUmh0FsqpbsbO8RgSg2JuTi0=; b=d7gUTD7loePn J52FvEdkPNC5zWrkh7FI8St+XsX/qX/xcnJDVaFGY85gKHJ32avcDA1gGa6/Fa94 QWAcpPtMcv4E/c5apdtFeRzSzQ7ptow6G1j1Og0LVYE7D1z4CImBsOiv+U9n8Xih DzWVwM6F1UmXG0pYJpZs84ZJJWKjzEA=
In-reply-to: <670106284.3108531.1414623171877.JavaMail.zimbra@xxxxxxxxxx>
References: <20141029200642.GA19804@xxxxxxxxxxxxxxx> <670106284.3108531.1414623171877.JavaMail.zimbra@xxxxxxxxxx>
User-agent: Mutt/1.5.21 (2012-12-30)
Hi Nathan,

On Wed, Oct 29, 2014 at 06:52:51PM -0400, Nathan Scott wrote:
> ----- Original Message -----
> > [...]
> > real    19m31.860s
> > user    19m24.391s
> > sys     0m2.566s
> > """
> 
> OOC, can you time a pmlogsummary run on this archive?

Sure:
time pmlogsummary 20141029.00.10 &> /dev/null

real    0m8.651s
user    0m2.913s
sys     0m0.229s

It's a python issue due to all the type conversions mostly.

> > While 20 minutes to parse such a big archive might be relatively ok, I
> > was wondering what options I have to improve this. The ones I can
> > currently think of are:
> > 
> > 1) Split the time interval parsing over multiple CPUs. I can divide the
> > archive in subintervals (one per cpu) and have each CPU do its own
> > subinterval parsing and then stitch everything together at the end.
> > This is the approach I currently use to create the graph images that go
> > in the pdf (as matplotlib+reportlab aren't the fastest thing on the
> > planet)
> 
> Should definitely help, since it appears to be CPU bound currently.
> 
> > 2) Implement a function in the C python bindings which returns a
> > python dictionary as described above.  This would save me all the
> > ctypes/__init__ costs and probably I would shave some time off as there
> > would be less python->C function calls. Maybe we can find a generic
> > enough API for this to be usable by other clients?
> 
> Yep, sounds good.
> 
> > 3) See if I can use Cython tricks to speed up things
> > 
> > 4) Anything else I have not thought of?
> 
> pmlogsummary uses that raw archive fetching interface we talked about
> awhile back, which isn't always ideal for your needs - I'm interested
> in seeing the time difference though, if you could run that locally?

You mean pmFetchArchive() I assume? I did not notice any speed
improvements when using that one from python (plus it does not support
INTERP).

I'll give 2) a try and then we can see if it is generic enough for other
python users.

Thanks,
Michele
-- 
Michele Baldessari            <michele@xxxxxxxxxx>
C2A5 9DA3 9961 4FFB E01B  D0BC DDD4 DCCB 7515 5C6D

<Prev in Thread] Current Thread [Next in Thread>