> >
> > Hi,
> >
> > Using the following test.sh script :
> >
> > #!/bin/sh
> >
> > counter=1000
> > while [ $counter -le 2000 ]
> > do
> >
> > echo "File number $counter" >> file$counter
> > counter=$(( $counter + 1))
> > done
> >
> > I try to create 1000 (1001 to be precise) files in an xfs mounted directory
.
> > This is waht happens:
> >
> > beauty:/mnt/test# sh /tmp/test.sh
> > beauty:/mnt/test# ls | wc
> > 995 995 8955
> > beauty:/mnt/test# rm *
> > beauty:/mnt/test# ls
> > file1144 file1291 file1438 file1585 file1732 file1879
> > beauty:/mnt/test# ls
> >
>
> This looks very much like a bug we found in glibc in the getdents syscall
> interface routine having to do with d_off values in the dirent structure
> using bit 2^31, and getdents64 not using lseek64.. Originally it showed
> up when running a 2.3 kernel as a client to an NFS server exporting
> an XFS filesystem. I was thinking we'd defaulted to dir2 format, and
> that should've kept us from seeing this problem, looks like more
> digging is required..
>
Ted, this is caused by something in the dir 2 handling of the d_off field
in the dirent structure. We are indeed hitting the scenario where the
glibc getdents code does a seek backwards. The d_off field is supposed to
be the offset of the following directory entry. However, running strace
on an ls on a large directory shows that it seeks to a specific offset,
but the next getdents call comes out starting with the record after the
one we are dealing with - hence we skip one. These offsets are not real
offsets of course.
I modified the script to this:
#!/bin/sh
counter=0
while [ $counter -le 2000 ]
do
echo "File number $counter" >> long-named-file$counter
counter=$(( $counter + 1))
done
Here is a snapshot of strace output:
{d_ino=2743392, d_off=6638, d_reclen=32, d_name="long-named-file1644"}
{d_ino=2743393, d_off=6642, d_reclen=32, d_name="long-named-file1645"}
{d_ino=2743394, d_off=6646, d_reclen=32, d_name="long-named-file1646"}
{d_ino=2743395, d_off=6650, d_reclen=32, d_name="long-named-file1647"}
{d_ino=2743396, d_off=6654, d_reclen=32, d_name="long-named-file1648"}
{d_ino=2743397, d_off=6662, d_reclen=32, d_name="long-named-file1649"}
{d_ino=2743398, d_off=6666, d_reclen=32, d_name="long-named-file1650"}
{d_ino=2743399, d_off=6670, d_reclen=32, d_name="long-named-file1651"}
{d_ino=2743400, d_off=6674, d_reclen=32, d_name="long-named-file1652"}
.... and so on
{d_ino=2744476, d_off=6882, d_reclen=32, d_name="long-named-file1704"}
{d_ino=2744477, d_off=6882, d_reclen=32, d_name="long-named-file1705"}},
54241) = 54220
lseek(4, 6646, SEEK_SET) = 6646
brk(0x806f000) = 0x806f000
brk(0x807b000) = 0x807b000
brk(0x8092000) = 0x8092000
mmap(NULL, 188416, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x4010b000
mremap(0x4010b000, 188416, 372736, MREMAP_MAYMOVE) = 0x4010b000
getdents(4, {
{d_ino=2743396, d_off=6654, d_reclen=32, d_name="long-named-file1648"}
{d_ino=2743397, d_off=6662, d_reclen=32, d_name="long-named-file1649"}
So we should have seeked to the record for file 1647, but it went
missing.
I suspect some of my other changes fixed the original script - maybe
ls started using larger buffers? Or maybe my ls/libc is different from the
one which hit the original problem.
we also have reports of ls going into an infinite loop over NFS which
could be related to this.
Steve
|