xfs
[Top] [All Lists]

Re: Data can't be wrote to XFS RIP [<ffffffffa041a99a>] xfs_dir2_sf_get_

To: Brian Foster <bfoster@xxxxxxxxxx>
Subject: Re: Data can't be wrote to XFS RIP [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20
From: Kuo Hugo <tonytkdk@xxxxxxxxx>
Date: Wed, 22 Jul 2015 16:54:11 +0800
Cc: Hugo Kuo <hugo@xxxxxxxxxxxxxx>, Eric Sandeen <sandeen@xxxxxxxxxxx>, Darrell Bishop <darrell@xxxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=vOQgAsD9bJBoPHornxsqMRdDP3LJEV4oZje0hQlt0zQ=; b=ALjpz+10FrW7m3UJJvHZQiuGaqndl0htCn+GQ68A7bUVaaG+3g3c+33KdPFPRug/36 sYKU83a32kuZRaMaLhjyXtR3vmyASbBXgRNvSMKzJM3Dz/DHnl45o7NTM4GVui/RwkBK ebj0G+RFM8pzzA5MlrZzQK8LYnyB5efyY5oRb4WpUZI7KZPLF0A395q8wUzgX1nBu/fR IIhSuhzIT5mSSaFSw3xiMLMoga9Swq9NO9CwYVvc/13LZPB6zJ2VKqCY6ESFNTCwZ3Im eXqUElXdQiu5rUwMZ/YJOEJpJc+NY25IiwLbECyvGfG/jOSLf6QEV1gVr54Ve1yxG4AG Si/g==
In-reply-to: <20150720151256.GA17816@xxxxxxxxxxxxxxx>
References: <CA++_uhu=VNKtjax_JjsCZwDFT0Vk-CAjS5j=ba5+A5HL4nxpmA@xxxxxxxxxxxxxx> <20150709183255.GG63282@xxxxxxxxxxxxxxx> <CA++_uht5N6MtqUQfbB9A3R__UvR4aLN2q5-mFKiO__vU-Cxwpw@xxxxxxxxxxxxxx> <20150713125214.GA50787@xxxxxxxxxxxxxxx> <CA++_uhvrDBuP9nANTc0ZxZudDriYKrrtnaQUZzXPRLs0otD22w@xxxxxxxxxxxxxx> <20150713170158.GB50787@xxxxxxxxxxxxxxx> <CA++_uhvDrO2BmQ+q0bN=M_L-vUUaLZO9bHoKh0ntFveM5t-DNQ@xxxxxxxxxxxxxx> <CA++_uhuJNkO4MDyS_+veFpysGyqzhqLspB3g73DtUCQqK1F80Q@xxxxxxxxxxxxxx> <20150720114648.GB53450@xxxxxxxxxxxxxxx> <CA++_uhvwR1KucdHWnPzS5ysFuYyssFnUB95kS-piC_pRnq=dXw@xxxxxxxxxxxxxx> <20150720151256.GA17816@xxxxxxxxxxxxxxx>
Hi Brain,

>The original stacktrace shows the crash in a readdir request. I'm sure
>there are multiple things going on here (and there are a couple rename
>traces in the vmcore sitting on locks), of course, but where does the
>information about the rename come from?

I tracked source code of the application. It moves data to a quarantined area(another folder on same disk) under some conditions. In the bug report, it indicates a condition that DELETE(create empty file in a directory) object + list the directory will cause data MOVE (os.rename) to quarantined area(another folder). The os.rename function call is the only function of the application to touch quarantined folder.Â

>I'm not quite following here because I don't have enough context about
>what the application server is doing. So far, it sounds like we somehow
>have multiple threads competing to rename the same file..? Is there
>anything else in this directory at the time this sequence executes
>(e.g., a file with object data that also gets quarantined)?

The previous behavior (a bug in the application) should not trigger Kernel panic. Yes, there's multiple threads competing to DELETE(create a empty file) in the same directory also move the existing one to the quarantined area. I think this is the root cause of kernel panic. The scenario is 10 application workers raise 10 thread to do same thing in the same moment.Â

>Ideally, we'd ultimately like to translate this into a sequence of
>operations as seen by the fs that hopefully trigger the problem. We
>might have to start by reproducing through the application server.
>Looking back at that bug report, it sounds like a 'DELETE' is a
>high-level server operation that can consist of multiple sub-operations
>at the filesystem level (e.g., list, conditional rename if *.ts file
>exists, etc.). Do you have enough information through any of the above
>to try and run something against Swift that might explicitly reproduce
>the problem? For example, have one thread that creates and recreates the
>same object repeatedly and many more competing threads that try to
>remove (or whatever results in the quarantine) it? Note that I'm just
>grasping at straws here, you might be able to design a more accurate
>reproducer based on what it looks like is happening within Swift.

We observe this issue on production cluster. It's hard to have a free gear with 100% same HW to test it currently.Â
I'll try to figure out an approach to reproduce it. I'll update this mail thread if I can make it.Â

Thanks // Hugo

<Prev in Thread] Current Thread [Next in Thread>