Bug 158 - FAM does not recover from queue overflows
: FAM does not recover from queue overflows
Status: NEW
Product: fam
Classification: Unclassified
Component: fam
: unspecified
: All Linux
: P1 normal
: ---
Assigned To: Michael Wardle
:
http://oss.sgi.com/projects/fam/archi...
:
:
Depends on:
Blocks: 91
  Show dependency treegraph
 
Reported: 2002-07-04 14:36 CDT by Felix Kurth
Modified: 2007-05-23 07:16 CDT (History)
4 users (show)

See Also:


Attachments
block SIGIO and SIGRTMIN so we handle each one fully (549 bytes, patch)
2002-07-23 22:42 CDT, Michael Wardle
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Felix Kurth 2002-07-04 14:36:31 CDT
It often happens, that fam covers 100%cpu load until you restart it.
Even if no processes using fam. Only restarting will help.
Is there a way to track it down ?
When i use the -d option it doesnt seem to happen....
My System:
Linux 2.4.19-pre10 i686 AuthenticAMD
fam-2.6.7 with dnotify patch. Also some other users of us (gentoo) are
reporting this problems:
See:   
http://forums.gentoo.org/viewtopic.php?t=6761  
and
http://forums.gentoo.org/viewtopic.php?t=5708

i will help you to track down the problem, but need assistance.

felix
Comment 1 Michael Wardle 2002-07-04 21:20:42 CDT
Hi Felix, and thanks for your interest.

I don't think FAM is enabled by default on Gentoo, indeed it seems that the
user must manually install FAM.  Is FAM running from xinetd or some other way?

If the problem does not appear when using the -d flag, what behavior do you
get if you use the -f flag?

What clients are using FAM (GNOME, KDE, something else)?

Some users have reported problems with the DNotify patch.  Are you able to
rebuild FAM without the DNotify patch and report whether this problem still
occurs?
Comment 2 Felix Kurth 2002-07-05 07:06:14 CDT
Im running now with the -d option:
whats this ?
this is printed million times on the console  
But it will end when closing the last client (im using kongueror/kde-3.01)

felix

*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************
*************** overflow sigqueue ***********************

Comment 3 Michael Wardle 2002-07-07 15:50:56 CDT
The only place I see this text is in the DNotify patch.
Comment 4 Michael Wardle 2002-07-07 15:55:48 CDT
It seems the DNotify patch uses a queue of 1024 elements (as does FAM
currently), so if Konqueror is trying to monitor more than 1024 files
this may explain the problem.
Comment 5 Michael Wardle 2002-07-11 22:44:20 CDT
The bit in the DNotify code that seems to describe why this might occur reads:
"When the RT queue overflows we get a SIGIO".
Comment 6 Michael Wardle 2002-07-23 00:14:23 CDT
This issue relates to a message posted to fam@oss.sgi.com some time back.

I don't think my 
earlier comments in this bug are pertinent.

My guess is that the (fam|sig)queue is filled too 
quickly for FAM to recover,
but I could be wrong.

Is there any possibility of you looking 
into this, Alex?

FWIW: When I use the FAM test program to FAM /dev on my Red Hat 7.3 box 
I
actually get a famqueue overflow rather than a sigqueue overflow.
Comment 7 Alexander Larsson 2002-07-23 00:41:51 CDT
It should be able to recover from overflows, so it should be fixed. I'm going on
vacation tomorrow, so i don't have time to look at it right now. But eventually
i'd like to fix it.
Comment 8 Michael Wardle 2002-07-23 18:37:05 CDT
I notice a couple interesting things:
1. if I redirect output to a regular file (rather than my 
xterm), I can
   run test/test -d /dev successfully (in both instances I use fam -d)
2. if I put a 
couple printfs in the signal handler, I now get a sigqueue
   overflow rather than a famqueue 
overflow
Comment 9 Michael Wardle 2002-07-23 22:42:30 CDT
Created attachment 39 [details]
block SIGIO and SIGRTMIN so we handle each one fully
Comment 10 Michael Wardle 2002-07-23 22:47:41 CDT
As it seems that the events are being created faster than FAM/DNotify is
handling them (and it 
could be that we are getting signals before the
entire signal handling function can be 
performed), one way around this could
be to block the SIGIO (RT queue overflowed) and SIGRTMIN 
(file changed) 
signals, so we handle each signal one at a time.

I'm not familiar with real-
time signals, so please regard this as an idea
rather than a definite solution.  Whether both or 
only one signal type should
be blocked should be considered.  Doing things this way may also be 
far less
efficient, but it does seem to stop the problem. :-)
Comment 11 Michael Wardle 2003-01-13 14:43:18 CST
Wil Evers suggests building with -lrt -lpthread here:
<http://oss.sgi.com/projects/fam/mail_archive/200301/msg00011.html>
Comment 12 Michael Wardle 2003-01-22 15:36:04 CST
*** Bug 210 has been marked as a duplicate of this bug. ***
Comment 13 John Arthorne 2004-02-23 14:26:52 CST
I am trying to use FAM to implement an auto-refresh feature in the IDE tooling
of the eclipse project (www.eclipse.org).  Essentially, we want any resources in
the user's workspace to be automatically refreshed if changes are made using
some external tool.  I believe I am running into this apparent 1024 limit on the
number of directories that can be monitored. User workspaces in Eclipse can
easily contain thousands of directories. The behavior I see is that the FAM API
calls (FAMMonitorDirectory, FAMCancelMonitor) seem to hang once this limit is
exceeded. I take it this is a known limitation?  Any plans to remove this
limitation?
Comment 14 sudeer 2004-12-01 07:51:13 CST
Comment on attachment 39 [details]
block SIGIO and SIGRTMIN so we handle each one fully

AAAA
Comment 15 Ilya Konstantinov 2005-01-20 02:29:47 CST
I would be surprised if RedHat didn't already solve this.
RedHat currently ships Fedora with FAM 2.6.10. Could someone check whether
RedHat's current dnotify patch has any improvements over the one on SGI's site,
and if it has, merge those improvements into SGI's dnotify patch?
Comment 16 David 2007-05-23 05:16:31 CDT
Is this ~5yr old bug going to get fixed? :-X  I regularly see this problem and
have to restart fam due to the 99% cpu use it causes.