On Wed, Apr 01, 2015 at 05:09:11PM +0300, Danny Shavit wrote:
> Hello Dave,
> My name is Danny Shavit and I am with Zadara storage.
> We will appreciate your feedback reagrding an xfs_corruption and xfs_reapir
> issue.
>
> We found a corrupted xfs volume in one of our systems. It is around 1 TB
> size and about 12 M files.
> We run xfs_repair on the volume which succeeded after 42 minutes.
> We noticed that memory consumption raised to about 7.5 GB.
> Since some customers are using only 4GB (and sometimes even 2 GB) we tried
> running "xfs_repair -m 3200" on a 4GB RAM machine.
> However, this time an OOM event happened during handling of AG 26 during
> step 3.
> The log of xfs_repair is enclosed below.
> We will appreciate your feedback on the amount of memory needed for
> xfs_repair in general and when using "-m" option specifically.
> The xfs metadata dump (prior to xfs_repair) can be found here:
> https://zadarastorage-public.s3.amazonaws.com/xfs/xfsdump-prod-ebs_2015-03-30_23-00-38.tgz
> It is a 1.2 GB file (and 5.7 GB uncompressed).
>
> We will appreciate your feedback on the corruption pattern as well.
Have you tried something smaller, perhaps -m 2048? I just ran repair on
the metadump on a 4g vm. It oom'd with default options and completed in
a few minutes with -m 2048, though rss still peaked at around 3.6G.
Using -P seems to help at the cost of time. That took me ~20m, but rss
peaked around 2.4GB.
FWIW, I'm also on a recent xfsprogs:
# xfs_repair -V
xfs_repair version 3.2.2
Brian
> --
> Thank you,
> Danny Shavit
> Zadarastorage
>
> ---------- xfs_repair log ----------------
> root@vsa-00000428-vc-1:/export/4xfsdump# date; xfs_repair -v /dev/dm-55;
> date
> Tue Mar 31 02:28:04 PDT 2015
> Phase 1 - find and verify superblock...
> - block cache size set to 735288 entries
> Phase 2 - using internal log
> - zero log...
> zero_log: head block 1920 tail block 1920
> - scan filesystem freespace and inode maps...
> agi_freecount 54, counted 55 in ag 7
> sb_ifree 947, counted 948
> - found root inode chunk
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - agno = 5
> - agno = 6
> - agno = 7
> - agno = 8
> - agno = 9
> - agno = 10
> - agno = 11
> - agno = 12
> - agno = 13
> - agno = 14
> - agno = 15
> - agno = 16
> - agno = 17
> - agno = 18
> - agno = 19
> - agno = 20
> - agno = 21
> bad . entry in directory inode 5691013154, was 5691013170: correcting
> bad . entry in directory inode 5691013156, was 5691013172: correcting
> bad . entry in directory inode 5691013157, was 5691013173: correcting
> bad . entry in directory inode 5691013163, was 5691013179: correcting
> - agno = 22
> - agno = 23
> - agno = 24
> - agno = 25
> - agno = 26 (Danny: OOM occurred here with -m 3200)
> - agno = 27
> - agno = 28
> - agno = 29
> - agno = 30
> - agno = 31
> - agno = 32
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - agno = 5
> - agno = 6
> - agno = 7
> - agno = 8
> - agno = 9
> - agno = 10
> - agno = 11
> - agno = 12
> - agno = 13
> - agno = 14
> - agno = 15
> - agno = 16
> - agno = 17
> - agno = 18
> - agno = 19
> - agno = 20
> - agno = 21
> - agno = 22
> - agno = 23
> - agno = 24
> - agno = 25
> - agno = 26
> - agno = 27
> - agno = 28
> - agno = 29
> - agno = 30
> - agno = 31
> - agno = 32
> Phase 5 - rebuild AG headers and trees...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - agno = 5
> - agno = 6
> - agno = 7
> - agno = 8
> - agno = 9
> - agno = 10
> - agno = 11
> - agno = 12
> - agno = 13
> - agno = 14
> - agno = 15
> - agno = 16
> - agno = 17
> - agno = 18
> - agno = 19
> - agno = 20
> - agno = 21
> - agno = 22
> - agno = 23
> - agno = 24
> - agno = 25
> - agno = 26
> - agno = 27
> - agno = 28
> - agno = 29
> - agno = 30
> - agno = 31
> - agno = 32
> - reset superblock...
> Phase 6 - check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> - agno = 4
> - agno = 5
> - agno = 6
> - agno = 7
> - agno = 8
> - agno = 9
> - agno = 10
> - agno = 11
> entry "SavedXML" in dir inode 2992927241 inconsistent with .. value
> (4324257659) in ino 5691013156
> will clear entry "SavedXML"
> rebuilding directory inode 2992927241
> - agno = 12
> - agno = 13
> - agno = 14
> - agno = 15
> - agno = 16
> entry "Out" in dir inode 4324257659 inconsistent with .. value (2992927241)
> in ino 5691013172
> will clear entry "Out"
> rebuilding directory inode 4324257659
> - agno = 17
> - agno = 18
> - agno = 19
> - agno = 20
> - agno = 21
> entry "tocs_file" in dir inode 5691012138 inconsistent with .. value
> (3520464676) in ino 5691013154
> will clear entry "tocs_file"
> entry "trees.log" in dir inode 5691012138 inconsistent with .. value
> (3791956240) in ino 5691013155
> will clear entry "trees.log"
> rebuilding directory inode 5691012138
> entry "filelist.xml" in directory inode 5691012139 not consistent with ..
> value (1909707067) in inode 5691013157,
> junking entry
> fixing i8count in inode 5691012139
> entry "image001.jpg" in directory inode 5691012140 not consistent with ..
> value (2450176033) in inode 5691013163,
> junking entry
> fixing i8count in inode 5691012140
> entry "OCR" in dir inode 5691013154 inconsistent with .. value (5691013170)
> in ino 1909707065
> will clear entry "OCR"
> entry "Tmp" in dir inode 5691013154 inconsistent with .. value (5691013170)
> in ino 2179087403
> will clear entry "Tmp"
> entry "images" in dir inode 5691013154 inconsistent with .. value
> (5691013170) in ino 2450176007
> will clear entry "images"
> rebuilding directory inode 5691013154
> entry "286_Kellman_Hoffer_Master.pdf_files" in dir inode 5691013156
> inconsistent with .. value (5691013172) in ino 834535727
> will clear entry "286_Kellman_Hoffer_Master.pdf_files"
> rebuilding directory inode 5691013156
> - agno = 22
> - agno = 23
> - agno = 24
> - agno = 25
> - agno = 26
> - agno = 27
> - agno = 28
> - agno = 29
> - agno = 30
> - agno = 31
> - agno = 32
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> disconnected dir inode 834535727, moving to lost+found
> disconnected dir inode 1909707065, moving to lost+found
> disconnected dir inode 2179087403, moving to lost+found
> disconnected dir inode 2450176007, moving to lost+found
> disconnected dir inode 5691013154, moving to lost+found
> disconnected dir inode 5691013155, moving to lost+found
> disconnected dir inode 5691013156, moving to lost+found
> disconnected dir inode 5691013157, moving to lost+found
> disconnected dir inode 5691013163, moving to lost+found
> disconnected dir inode 5691013172, moving to lost+found
> Phase 7 - verify and correct link counts...
> resetting inode 81777983 nlinks from 2 to 12
> resetting inode 1909210410 nlinks from 1 to 2
> resetting inode 1909707067 nlinks from 3 to 2
> resetting inode 2450176033 nlinks from 18 to 17
> resetting inode 2992927241 nlinks from 13 to 12
> resetting inode 3520464676 nlinks from 13 to 12
> resetting inode 3791956240 nlinks from 13 to 12
> resetting inode 4324257659 nlinks from 13 to 12
> resetting inode 5691013154 nlinks from 5 to 2
> resetting inode 5691013156 nlinks from 3 to 2
>
> XFS_REPAIR Summary Tue Mar 31 03:11:00 2015
>
> Phase Start End Duration
> Phase 1: 03/31 02:28:04 03/31 02:28:05 1 second
> Phase 2: 03/31 02:28:05 03/31 02:28:42 37 seconds
> Phase 3: 03/31 02:28:42 03/31 02:48:29 19 minutes, 47 seconds
> Phase 4: 03/31 02:48:29 03/31 02:55:40 7 minutes, 11 seconds
> Phase 5: 03/31 02:55:40 03/31 02:55:43 3 seconds
> Phase 6: 03/31 02:55:43 03/31 03:10:57 15 minutes, 14 seconds
> Phase 7: 03/31 03:10:57 03/31 03:10:57
>
> Total run time: 42 minutes, 53 seconds
> done
> Tue Mar 31 03:11:01 PDT 2015
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs
|