pcp
[Top] [All Lists]

scsi indom in linux PMDA - issues!

To: <pcp@xxxxxxxxxxx>
Subject: scsi indom in linux PMDA - issues!
From: "Ken McDonell" <kenj@xxxxxxxxxxxxxxxx>
Date: Wed, 21 Jan 2015 10:48:57 +1100
Delivered-to: pcp@xxxxxxxxxxx
Thread-index: AdA1BrDBlZa3Cf+2T7S6taCtjhP3Xw==
On the trail of qa/957 failures ... Mark's looking at the most common case,
but I found another different one that looks like a small mem leak (at
first).

The issue is around the malloc at line 75 in proc_scsi.c.

At first I thought it was a leak on one of the error paths ... and indeed
there are a couple.  But I fixed these and the problem did not go away.

Digging deeper, I noticed none of the return codes from the pmdaCacheFoo()
calls in this file are checked, and one of the calls makes no sense at all.
Culled the latter and added return status capture and check, and lo and
behold the two leaks are associated with these cache "add" calls ...

Warning: refresh_proc_scsi: pmdaCacheOp(60.11, ADD, "scsi0:0:0:0 CD-ROM",
(sr0)): Invalid argument
Warning: refresh_proc_scsi: pmdaCacheOp(60.11, ADD, "scsi4:0:0:0
Direct-Access", (sdb)): Invalid argument

OK, so fix the 2 (!) leaks on this (new) error path, but why is it happening
and only on one QA host?

After some further digging, I discovered that the indom cache does not match
the current kernel state.

The failing state was this ...
kenj@bozo:/var/lib/pcp/config/pmda$ cat 60.11.old
1 0
0 1421057940 scsi0:0:0:0 Direct-Access
1 1421057940 scsi2:0:0:0 Direct-Access
2 1421057940 scsi4:0:0:0 CD-ROM
3 1418333286 scsi6:0:0:0 Direct-Access
4 1421057940 scsi7:0:0:0 Direct-Access

And if I blew it away and restarted pmcd, the indom cache looks like this
...
root@bozo:/var/lib/pcp/config/pmda# cat 60.11
1 0
0 1421796486 scsi0:0:0:0 CD-ROM
1 1421796486 scsi2:0:0:0 Direct-Access
2 1421796486 scsi4:0:0:0 Direct-Access
3 1421796486 scsi6:0:0:0 Direct-Access

(note the names for scsi0:... and scsi4:... are reversed, so the error from
the pmdaCache routine is correct

And then the mem leak in the qa test went away.

So the _fundamental_ question is what is the semantics of this instance
domain?  Under what circumstances is it expected to change ... h/w reconfig
or non-determinism in the kernel's scsi scanning code?

If it is subject to expected change, then it should not be backed by
permanent store, but recreated each time as needed.

In the process of this investigation I quickly checked some other
pmdaCacheFoo() use in the linux PMDA and there are many issues ... this code
(and possibly across all PMDAs) needs a thoroughly good audit by someone who
understands how the pmdaCache services really work.

<Prev in Thread] Current Thread [Next in Thread>