On Thu, 2004-05-06 at 13:05 -0600, Craig Tierney wrote:
> > I've mostly ruled out the problem being per-page since as the test in
> > 198 shows the corruption is the size of give write. In the case of
> > doublewrite 16k.
> > What appears to be happening as the file is grown out of order in
> > relation to the writes that are happening every once in a while a 16k
> > write is simply forgotten about.
> > In other words write 16k at 16k then write 16k at 0k.
> > So grow to 32k starting at 16k then go back and fill in
> > the 0-16k block and sometimes the 0-16k block is simply not written
> > to disk.
> > I'm going to keep looking at this i_sem not locked theory a bit longer
> > before I give up on yet another theory.
> > If you have any observations theory let me know it might be helpful.
> It appears my corruption pattern is different (but not unrelated).
> I finally confirmed that when a file is corrupted, it is corrupted
> with data occurring from another write. For example, Temperature gets
> overwritten when height, or velocity is overwritten with humidity.
The problem is trying to figure out if the data is stale or really from
a different process.
If the data block was previously used and then freed either by removing
the old file or truncating it and then the new file reallocates that
free block into its file but never writes to that block, the data is
Now if you pattern up a partition with zeros or some known pattern, make
a file system run your test with no removes or truncates and you still
see data from another client, that is not a stale data problem.
That would indicate a page management problem, (usually a locking
problem) the size of corruption is usually limited to a page size
in those cases.
If I remember right you were seeing problems or 1 or more pages in
I wish I had something for you to try, my latest go around on this bug
is not looking promising.
I thought it might have something to do with i_sem not being held
for writes, but I'm still seeing the problem even with the i_sem.
I need to put an assert in generic_file_write just to make sure
i_sem is being coming in locked.
> Unless there is something really wrong in caching, the corruption
> must becoming from another client process. In one case, process
> 3 showed corruption in file 8. The corruption in file 8 matches
> data from file 14 (files are written in order). Processes
> are all started at the same time, but they do get out of sync
> because of IO and other load issues.
> It seems like buffers are overwritten, or if the IO is async
> that a buffer is being used before it is actually no longer in
Russell Cattelan <cattelan@xxxxxxx>
Description: This is a digitally signed message part