From owner-linux-origin@oss.sgi.com Mon Oct 2 04:23:28 2000 Received: by oss.sgi.com id ; Mon, 2 Oct 2000 04:23:18 -0700 Received: from delta.ds2.pg.gda.pl ([153.19.144.1]:25047 "EHLO delta.ds2.pg.gda.pl") by oss.sgi.com with ESMTP id ; Mon, 2 Oct 2000 04:22:55 -0700 Received: from localhost by delta.ds2.pg.gda.pl (8.9.3/8.9.3) with SMTP id NAA08294; Mon, 2 Oct 2000 13:20:20 +0200 (MET DST) Date: Mon, 2 Oct 2000 13:20:19 +0200 (MET DST) From: "Maciej W. Rozycki" To: Ralf Baechle cc: linux-origin@oss.sgi.com, Ulf Carlsson , Keith M Wesolowski Subject: Re: ld.so bug In-Reply-To: <20000912193424.A4052@bacchus.dhis.org> Message-ID: Organization: Technical University of Gdansk MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-origin@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-origin-outgoing On Tue, 12 Sep 2000, Ralf Baechle wrote: > Kanoj, got an idea why the kernel might load ld.so to a different address > than it is linked for? Because the suggested address space is already occupied? > Note that ldd is running ld.so directly, therefore on an Origin it will > always say ``not a dynamic executable''. This again will confuse > libtool into producing wrong library and rpm into generating packages > without library dependency information and probably a few more > neat little resulting bugs. For rpm see how I handle the require/provide lists in my RPM packages -- I try not to use ldd at all, as this does not work for cross-compilation. Readelf and objdump are much better tools to fetch such dependencies and have the advantage of not including indirect ones (which may vary between library releases). > Btw, this same problem should also affect glibc 2.2. It does. No idea, why, at the moment, but it can be easily reproduced e.g. by `/lib/ld.so.1 /bin/rpm -ya' (this way of invoking makes the ld.so preferred space be occupied by the direct invocation, so the second copy of ld.so that gets loaded by dlopening NSS modules gets mmapped at a non-standard address). Maciej -- + Maciej W. Rozycki, Technical University of Gdansk, Poland + +--------------------------------------------------------------+ + e-mail: macro@ds2.pg.gda.pl, PGP key available + From owner-linux-origin@oss.sgi.com Mon Oct 2 13:19:22 2000 Received: by oss.sgi.com id ; Mon, 2 Oct 2000 13:19:12 -0700 Received: from u-211.karlsruhe.ipdial.viaginterkom.de ([62.180.21.211]:36366 "EHLO u-211.karlsruhe.ipdial.viaginterkom.de") by oss.sgi.com with ESMTP id ; Mon, 2 Oct 2000 13:18:37 -0700 Received: (ralf@lappi) by lappi.waldorf-gmbh.de id ; Mon, 2 Oct 2000 10:33:34 +0200 Date: Mon, 2 Oct 2000 10:33:34 +0200 From: Ralf Baechle To: Kanoj Sarcar , linux-origin@oss.sgi.com Subject: Crashes Message-ID: <20001002103334.A29695@bacchus.dhis.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i X-Accept-Language: de,en,fr Sender: owner-linux-origin@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-origin-outgoing I've had a crash while doing cd /; find . -xdev -print0 | cpio -oumd0 /mnt. That copy went fine for a while, them I did cd /mnt/tmp; rm -rf * and shortly after this the machine crashed. The epc is pointing to 800d430c: [...] 800d42f8: 90a20000 lbu $v0,0($a1) 800d42fc: 14540005 bne $v0,$s4,800d4314 800d4300: 3c020007 lui $v0,0x7 800d4304: 0c0350fb jal 800d43ec 800d4308: 00a0202d move $a0,$a1 800d430c: 10000002 b 800d4318 800d4310: ae020228 sw $v0,552($s0) 800d4314: ae020228 sw $v0,552($s0) 800d4318: 960200d6 lhu $v0,214($s0) 800d431c: 5040000a beqzl $v0,800d4348 [...] this is a branch, so the fault was caused by the following instruction which was dereferencing a NULL pointer. What makes me more worried about this kind of crash is I also keep receiving reports about I/O errors and data corruption from users of the 32-bit kernel while copying from one disk to another physical disk, so exactly what also happened here. Ideas? Ralf From owner-linux-origin@oss.sgi.com Mon Oct 2 14:08:03 2000 Received: by oss.sgi.com id ; Mon, 2 Oct 2000 14:07:53 -0700 Received: from deliverator.sgi.com ([204.94.214.10]:28997 "EHLO deliverator.sgi.com") by oss.sgi.com with ESMTP id ; Mon, 2 Oct 2000 14:07:31 -0700 Received: from google.engr.sgi.com (google.engr.sgi.com [163.154.10.145]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id NAA04519; Mon, 2 Oct 2000 13:59:07 -0700 (PDT) mail_from (kanoj@google.engr.sgi.com) Received: (from kanoj@localhost) by google.engr.sgi.com (SGI-8.9.3/8.9.3) id OAA09593; Mon, 2 Oct 2000 14:05:35 -0700 (PDT) From: Kanoj Sarcar Message-Id: <200010022105.OAA09593@google.engr.sgi.com> Subject: Re: Crashes To: ralf@oss.sgi.com (Ralf Baechle) Date: Mon, 2 Oct 2000 14:05:35 -0700 (PDT) Cc: linux-origin@oss.sgi.com In-Reply-To: <20001002103334.A29695@bacchus.dhis.org> from "Ralf Baechle" at Oct 02, 2000 10:33:34 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-origin@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-origin-outgoing > > I've had a crash while doing cd /; find . -xdev -print0 | cpio -oumd0 /mnt. > That copy went fine for a while, them I did cd /mnt/tmp; rm -rf * and > shortly after this the machine crashed. The epc is pointing to 800d430c: > > [...] > 800d42f8: 90a20000 lbu $v0,0($a1) > 800d42fc: 14540005 bne $v0,$s4,800d4314 > 800d4300: 3c020007 lui $v0,0x7 > 800d4304: 0c0350fb jal 800d43ec > 800d4308: 00a0202d move $a0,$a1 > 800d430c: 10000002 b 800d4318 > 800d4310: ae020228 sw $v0,552($s0) > 800d4314: ae020228 sw $v0,552($s0) > 800d4318: 960200d6 lhu $v0,214($s0) > 800d431c: 5040000a beqzl $v0,800d4348 > [...] > > this is a branch, so the fault was caused by the following instruction which I hope you verified the BD bit was set in the Cause register printed out as part of the panic ... > was dereferencing a NULL pointer. What makes me more worried about this kind > of crash is I also keep receiving reports about I/O errors and data corruption This looks like the isp1020_intr_handler() code to me, I would be surprised if the 32-bit guys are using the same driver. In any case, it might be worthwhile to match up with C code and see which variable/pointer was NULL, that might give us a clue. AFAICS, the above asm code probably corresponds to this in isp1020_intr_handler: if (sts->hdr.entry_type == ENTRY_STATUS) Cmnd->result = isp1020_return_status(sts); else Cmnd->result = DID_ERROR << 16; It seems to me that Cmnd turned out to be 0, and I see that Cmnd = hostdata->cmd_slots[cmd_slot]; and this in fact could be the first place that uses Cmnd. Kanoj > from users of the 32-bit kernel while copying from one disk to another > physical disk, so exactly what also happened here. > > Ideas? > > Ralf > From owner-linux-origin@oss.sgi.com Tue Oct 3 09:28:20 2000 Received: by oss.sgi.com id ; Tue, 3 Oct 2000 09:28:10 -0700 Received: from u-206.karlsruhe.ipdial.viaginterkom.de ([62.180.19.206]:14609 "EHLO u-206.karlsruhe.ipdial.viaginterkom.de") by oss.sgi.com with ESMTP id ; Tue, 3 Oct 2000 09:27:30 -0700 Received: (ralf@lappi) by lappi.waldorf-gmbh.de id ; Tue, 3 Oct 2000 18:26:28 +0200 Date: Tue, 3 Oct 2000 18:26:28 +0200 From: Ralf Baechle To: Kanoj Sarcar Cc: linux-origin@oss.sgi.com Subject: Re: Crashes Message-ID: <20001003182628.B25215@bacchus.dhis.org> References: <20001002103334.A29695@bacchus.dhis.org> <200010022105.OAA09593@google.engr.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200010022105.OAA09593@google.engr.sgi.com>; from kanoj@google.engr.sgi.com on Mon, Oct 02, 2000 at 02:05:35PM -0700 X-Accept-Language: de,en,fr Sender: owner-linux-origin@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-origin-outgoing On Mon, Oct 02, 2000 at 02:05:35PM -0700, Kanoj Sarcar wrote: > This looks like the isp1020_intr_handler() code to me, I would be surprised > if the 32-bit guys are using the same driver. In any case, it might be > worthwhile to match up with C code and see which variable/pointer was > NULL, that might give us a clue. > > AFAICS, the above asm code probably corresponds to this in > isp1020_intr_handler: > > if (sts->hdr.entry_type == ENTRY_STATUS) > Cmnd->result = isp1020_return_status(sts); > else > Cmnd->result = DID_ERROR << 16; > > It seems to me that Cmnd turned out to be 0, and I see that > > Cmnd = hostdata->cmd_slots[cmd_slot]; > > and this in fact could be the first place that uses Cmnd. It looks like the changes to bitops.h from two days ago did fix the problem the other people were reporting. So the driver issue we have must be something independant. Ralf From owner-linux-origin@oss.sgi.com Thu Oct 5 05:26:28 2000 Received: by oss.sgi.com id ; Thu, 5 Oct 2000 05:26:18 -0700 Received: from u-143.karlsruhe.ipdial.viaginterkom.de ([62.180.18.143]:53252 "EHLO u-143.karlsruhe.ipdial.viaginterkom.de") by oss.sgi.com with ESMTP id ; Thu, 5 Oct 2000 05:25:52 -0700 Received: (ralf@lappi) by lappi.waldorf-gmbh.de id ; Thu, 5 Oct 2000 14:24:52 +0200 Date: Thu, 5 Oct 2000 14:24:52 +0200 From: Ralf Baechle To: linux-origin@oss.sgi.com Subject: Shared pages Message-ID: <20001005142452.A31160@bacchus.dhis.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i X-Accept-Language: de,en,fr Sender: owner-linux-origin@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-origin-outgoing Something is odd with our mm in test8-pre1. According to /proc/meminfo we don't have any shared pages. Haven't yet tried this in test9. Ideas what might be causing this? Ralf From owner-linux-origin@oss.sgi.com Thu Oct 5 08:17:39 2000 Received: by oss.sgi.com id ; Thu, 5 Oct 2000 08:17:29 -0700 Received: from Cantor.suse.de ([194.112.123.193]:50955 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Thu, 5 Oct 2000 08:16:59 -0700 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id E01A61E114; Thu, 5 Oct 2000 17:16:15 +0200 (MEST) Received: from gruyere.muc.suse.de (unknown [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id 52A693E44F; Thu, 5 Oct 2000 17:16:15 +0200 (MEST) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id 55D722F300; Thu, 5 Oct 2000 17:16:13 +0200 (MEST) Date: Thu, 5 Oct 2000 17:16:13 +0200 From: "Andi Kleen" To: Ralf Baechle Cc: linux-origin@oss.sgi.com Subject: Re: Shared pages Message-ID: <20001005171613.A13082@gruyere.muc.suse.de> References: <20001005142452.A31160@bacchus.dhis.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <20001005142452.A31160@bacchus.dhis.org>; from ralf@oss.sgi.com on Thu, Oct 05, 2000 at 02:24:52PM +0200 Sender: owner-linux-origin@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;linux-origin-outgoing Hi Ralf, On Thu, Oct 05, 2000 at 02:24:52PM +0200, Ralf Baechle wrote: > Something is odd with our mm in test8-pre1. According to /proc/meminfo > we don't have any shared pages. Haven't yet tried this in test9. Ideas > what might be causing this? Shared page info was removed in latest 2.4, because it requires long table walks which cause scheduling anomalies when you have several GB of RAM. When done once per second by a system monitor on a really big system the mouse pointer hangs often. -Andi