| To: | Brian Foster <bfoster@xxxxxxxxxx> |
|---|---|
| Subject: | Re: Data can't be wrote to XFS RIP [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20 |
| From: | Kuo Hugo <tonytkdk@xxxxxxxxx> |
| Date: | Wed, 22 Jul 2015 16:54:11 +0800 |
| Cc: | Hugo Kuo <hugo@xxxxxxxxxxxxxx>, Eric Sandeen <sandeen@xxxxxxxxxxx>, Darrell Bishop <darrell@xxxxxxxxxxxxxx>, xfs@xxxxxxxxxxx |
| Delivered-to: | xfs@xxxxxxxxxxx |
| Dkim-signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=vOQgAsD9bJBoPHornxsqMRdDP3LJEV4oZje0hQlt0zQ=; b=ALjpz+10FrW7m3UJJvHZQiuGaqndl0htCn+GQ68A7bUVaaG+3g3c+33KdPFPRug/36 sYKU83a32kuZRaMaLhjyXtR3vmyASbBXgRNvSMKzJM3Dz/DHnl45o7NTM4GVui/RwkBK ebj0G+RFM8pzzA5MlrZzQK8LYnyB5efyY5oRb4WpUZI7KZPLF0A395q8wUzgX1nBu/fR IIhSuhzIT5mSSaFSw3xiMLMoga9Swq9NO9CwYVvc/13LZPB6zJ2VKqCY6ESFNTCwZ3Im eXqUElXdQiu5rUwMZ/YJOEJpJc+NY25IiwLbECyvGfG/jOSLf6QEV1gVr54Ve1yxG4AG Si/g== |
| In-reply-to: | <20150720151256.GA17816@xxxxxxxxxxxxxxx> |
| References: | <CA++_uhu=VNKtjax_JjsCZwDFT0Vk-CAjS5j=ba5+A5HL4nxpmA@xxxxxxxxxxxxxx> <20150709183255.GG63282@xxxxxxxxxxxxxxx> <CA++_uht5N6MtqUQfbB9A3R__UvR4aLN2q5-mFKiO__vU-Cxwpw@xxxxxxxxxxxxxx> <20150713125214.GA50787@xxxxxxxxxxxxxxx> <CA++_uhvrDBuP9nANTc0ZxZudDriYKrrtnaQUZzXPRLs0otD22w@xxxxxxxxxxxxxx> <20150713170158.GB50787@xxxxxxxxxxxxxxx> <CA++_uhvDrO2BmQ+q0bN=M_L-vUUaLZO9bHoKh0ntFveM5t-DNQ@xxxxxxxxxxxxxx> <CA++_uhuJNkO4MDyS_+veFpysGyqzhqLspB3g73DtUCQqK1F80Q@xxxxxxxxxxxxxx> <20150720114648.GB53450@xxxxxxxxxxxxxxx> <CA++_uhvwR1KucdHWnPzS5ysFuYyssFnUB95kS-piC_pRnq=dXw@xxxxxxxxxxxxxx> <20150720151256.GA17816@xxxxxxxxxxxxxxx> |
|
Hi Brain, >The original stacktrace shows the crash in a readdir request. I'm sure >there are multiple things going on here (and there are a couple rename >traces in the vmcore sitting on locks), of course, but where does the >information about the rename come from? I tracked source code of the application. It moves data to a quarantined area(another folder on same disk) under some conditions. In the bug report, it indicates a condition that DELETE(create empty file in a directory) object + list the directory will cause data MOVE (os.rename) to quarantined area(another folder). The os.rename function call is the only function of the application to touch quarantined folder. >I'm not quite following here because I don't have enough context about >what the application server is doing. So far, it sounds like we somehow >have multiple threads competing to rename the same file..? Is there >anything else in this directory at the time this sequence executes >(e.g., a file with object data that also gets quarantined)? The previous behavior (a bug in the application) should not trigger Kernel panic. Yes, there's multiple threads competing to DELETE(create a empty file) in the same directory also move the existing one to the quarantined area. I think this is the root cause of kernel panic. The scenario is 10 application workers raise 10 thread to do same thing in the same moment. >Ideally, we'd ultimately like to translate this into a sequence of >operations as seen by the fs that hopefully trigger the problem. We >might have to start by reproducing through the application server. >Looking back at that bug report, it sounds like a 'DELETE' is a >high-level server operation that can consist of multiple sub-operations >at the filesystem level (e.g., list, conditional rename if *.ts file >exists, etc.). Do you have enough information through any of the above >to try and run something against Swift that might explicitly reproduce >the problem? For example, have one thread that creates and recreates the >same object repeatedly and many more competing threads that try to >remove (or whatever results in the quarantine) it? Note that I'm just >grasping at straws here, you might be able to design a more accurate >reproducer based on what it looks like is happening within Swift. We observe this issue on production cluster. It's hard to have a free gear with 100% same HW to test it currently. I'll try to figure out an approach to reproduce it. I'll update this mail thread if I can make it. Thanks // Hugo |
| Previous by Date: | Re:High Quality Solar Panel from CNBM (A Fortune Globe 500 Company), Jason Ding |
|---|---|
| Next by Date: | mes[89157]:ÐÐÑÑÐÐÐÐ 44ÐÐ Ð ÐÐÑÐÐÐ, The contract system |
| Previous by Thread: | Re: Data can't be wrote to XFS RIP [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20, Brian Foster |
| Next by Thread: | The (desired) state OS X support for xfsprogs, Jan Tulak |
| Indexes: | [Date] [Thread] [Top] [All Lists] |