[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Patch 1300 & rpm issue with 1.3.0



Steve Lord wrote:
> On Thu, 2003-08-28 at 17:51, Russell Cattelan wrote:
> 
>>On Thu, 2003-08-28 at 14:41, Foris, Jim (MED) wrote:
>>
>>>Eric Sandeen wrote:
>>>
>>>>On Thu, 28 Aug 2003, Kai Leibrandt wrote:
>>>>
>>>>
>>>>
>>>>>That's just what I was thinking; is rpm only an indication that other
>>>>>apps might have issues as well? If so, how do we identify them and
>>>>>rectify the problems? In the kernel, or in the app?
>>>>
>>>>
>>>>That's not clear to me yet, but we have dome some O_DIRECT stresstesting
>>>>and it's all been fine.  So this doesn't seem to be a problem with
>>>>O_DIRECT in general, which makes me think it might be the app.
>>>>
>>>
>>>Using "strace" on a RH 2.4.20-20.9.XFS1.3.0 system to follow what "rpm" does
>>>during an install, the key difference seems to be the following sequence:
>>>
>>>WORKS (created a EXT3 partition, copied /var/lib/rpm/* to it, then mounted it at 
>>>/var/lib/rpm)
>>
>>I this ext2 or ext3?
>>ext2 will turn off O_DIRECT after the open call
>>ext3 was suppose to, eric has a new patch to fix that.

As stated above, it was EXT3 (I forgot to mention that Eric's patch had
been applied :-) ).
> 
> 
> This looks like memory alignment of the write buffer. The alignment of
> the memory may be constrained differently, possibly ext3 is not doing
> O_DIRECT so is not constraining I/O alignment. It would be good to see
> the address of the buffer passed into the write call. 

Turns out that information is in my original posting:

     4144  write(2, "write: 0xbffed120, 8192: Invalid"..., 41) = 41 <0.000012>

So the buffer address, 0xbffed120, is NOT correctly alligned.


AND THE MYSTERY IS SOLVED; RPM fails because the person who tried to use
O_DIRECT file access to an internal database file did not check for/guarantee
correct buffer address alignment.  This bug did not show up to Red Hat because
they never tested it (RPM) on a file system that actually supports O_DIRECT
(because they don't have any).

The options to solve the problem become clear:

1. Build the kernel w/o O_DIRECT support (leave in patch 1300).
2. Build a kernel with Erics patch and always have "/var/lib/rpm" reside on a
     non-XFS partition.
3. Put LD_ASSUME_KERNEL into the environment when any "rpm" call is made.
4. Fix "rpm-4.2" (either by removing the ability to set O_DIRECT  , or by adding
     the necessary buffer boundry checks).

Personally, I think I will probably patch "rpm-4.2" since that is where the bug is.

Thanks to everyone,

Jim Foris


(And, by the way, there is no use of O_DIRECT in the db-4 code.  This is a pure
  RPM bug).

> 
> Steve
> 
> 
>