[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Patch 1300 & rpm issue with 1.3.0
- To: Steve Lord <lord@sgi.com>
- Subject: Re: Patch 1300 & rpm issue with 1.3.0
- From: "Foris, Jim (MED)" <foris@mr.mr.med.ge.com>
- Date: Fri, 29 Aug 2003 06:57:11 -0500
- Cc: Russell Cattelan <cattelan@xfs.org>, "Foris, Jim (MED)" <foris@mr.mr.med.ge.com>, Eric Sandeen <sandeen@sgi.com>, Kai Leibrandt <k_leibrandt@hotmail.com>, "'Simon Matter'" <simon.matter@ch.sauter-bc.com>, "'Axel Thimm'" <Axel.Thimm@physik.fu-berlin.de>, linux-xfs@oss.sgi.com
- In-reply-to: <1062115583.1695.25.camel@laptop.americas.sgi.com>
- References: <Pine.LNX.4.44.0308280914100.19961-100000@stout.americas.sgi.com> <3F4E5AD3.80101@med.ge.com> <1062111109.4318.6.camel@naboo> <1062115583.1695.25.camel@laptop.americas.sgi.com>
- Reply-to: "Foris, Jim (MED)" <james.foris@med.ge.com>
- Sender: linux-xfs-bounce@oss.sgi.com
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030314
Steve Lord wrote:
> On Thu, 2003-08-28 at 17:51, Russell Cattelan wrote:
>
>>On Thu, 2003-08-28 at 14:41, Foris, Jim (MED) wrote:
>>
>>>Eric Sandeen wrote:
>>>
>>>>On Thu, 28 Aug 2003, Kai Leibrandt wrote:
>>>>
>>>>
>>>>
>>>>>That's just what I was thinking; is rpm only an indication that other
>>>>>apps might have issues as well? If so, how do we identify them and
>>>>>rectify the problems? In the kernel, or in the app?
>>>>
>>>>
>>>>That's not clear to me yet, but we have dome some O_DIRECT stresstesting
>>>>and it's all been fine. So this doesn't seem to be a problem with
>>>>O_DIRECT in general, which makes me think it might be the app.
>>>>
>>>
>>>Using "strace" on a RH 2.4.20-20.9.XFS1.3.0 system to follow what "rpm" does
>>>during an install, the key difference seems to be the following sequence:
>>>
>>>WORKS (created a EXT3 partition, copied /var/lib/rpm/* to it, then mounted it at
>>>/var/lib/rpm)
>>
>>I this ext2 or ext3?
>>ext2 will turn off O_DIRECT after the open call
>>ext3 was suppose to, eric has a new patch to fix that.
As stated above, it was EXT3 (I forgot to mention that Eric's patch had
been applied :-) ).
>
>
> This looks like memory alignment of the write buffer. The alignment of
> the memory may be constrained differently, possibly ext3 is not doing
> O_DIRECT so is not constraining I/O alignment. It would be good to see
> the address of the buffer passed into the write call.
Turns out that information is in my original posting:
4144 write(2, "write: 0xbffed120, 8192: Invalid"..., 41) = 41 <0.000012>
So the buffer address, 0xbffed120, is NOT correctly alligned.
AND THE MYSTERY IS SOLVED; RPM fails because the person who tried to use
O_DIRECT file access to an internal database file did not check for/guarantee
correct buffer address alignment. This bug did not show up to Red Hat because
they never tested it (RPM) on a file system that actually supports O_DIRECT
(because they don't have any).
The options to solve the problem become clear:
1. Build the kernel w/o O_DIRECT support (leave in patch 1300).
2. Build a kernel with Erics patch and always have "/var/lib/rpm" reside on a
non-XFS partition.
3. Put LD_ASSUME_KERNEL into the environment when any "rpm" call is made.
4. Fix "rpm-4.2" (either by removing the ability to set O_DIRECT , or by adding
the necessary buffer boundry checks).
Personally, I think I will probably patch "rpm-4.2" since that is where the bug is.
Thanks to everyone,
Jim Foris
(And, by the way, there is no use of O_DIRECT in the db-4 code. This is a pure
RPM bug).
>
> Steve
>
>
>