xfs
[Top] [All Lists]

Re: [PATCH 23/22] xfs: add metadata CRC documentation

To: xfs@xxxxxxxxxxx
Subject: Re: [PATCH 23/22] xfs: add metadata CRC documentation
From: Dave Howorth <dhoworth@xxxxxxxxxxxxxxxxx>
Date: Fri, 05 Apr 2013 12:20:56 +0100
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <23143019.xAQVWLRqiQ@xrated>
Organization: MRC Laboratory for Molecular Biology
References: <1364965892-19623-1-git-send-email-david@xxxxxxxxxxxxx> <20130405070006.GJ12011@dastard> <23143019.xAQVWLRqiQ@xrated>
User-agent: Thunderbird 1.5.0.10 (X11/20060911)
Most illuminating, thank you :) Allow me to add a few more nitpicks.

Hans-Peter Jansen wrote:
> Hi Dave,
> 
> On Freitag, 5. April 2013 18:00:06 Dave Chinner wrote:
>> xfs: add metadata CRC documentation
>>
>> From: Dave Chinner <dchinner@xxxxxxxxxx>
>>
>> Add some documentation about the self describing metadata and the
>> code templates used to implement it.
> 
> Nice text. This is the coolest addition to XFS since invention of sliced 
> bread.
> 
> One question arose from reading: since only the metadata is protected, any
> corruption of data blocks (file content) will still go unnoticed, does it?
> 
> Allow me to propose some minor corrections (from the nitpick department..).
> 
>> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
>> ---
>>  .../filesystems/xfs-self-describing-metadata.txt   |  352 
>> ++++++++++++++++++++
>>  1 file changed, 352 insertions(+)
>>
>> diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt 
>> b/Documentation/filesystems/xfs-self-describing-metadata.txt
>> new file mode 100644
>> index 0000000..da7edc9
>> --- /dev/null
>> +++ b/Documentation/filesystems/xfs-self-describing-metadata.txt
>> @@ -0,0 +1,352 @@
>> +XFS Self Describing Metadata
>> +----------------------------
>> +
>> +Introduction
>> +------------
>> +
>> +The largest scalability problem facing XFS is not one of algorithmic
>> +scalability, but of verification of the filesystem structure. Scalabilty of 
>> the
>> +structures and indexes on disk and the algorithms for iterating them are
>> +adequate for supporting PB scale filesystems with billions of inodes, 
>> however it
>> +is this very scalability that causes the verification problem.
>> +
>> +Almost all metadata on XFS is dynamically allocated. The only fixed location
>> +metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
>> +other metadata structures need to be discovered by walking the filesystem
>> +structure in different ways. While this is already done by userspace tools 
>> for
>> +validating and repairing the structure, there are limits to what they can
>> +verify, and this in turn limits the supportable size of an XFS filesystem.
>> +
>> +For example, it is entirely possible to manually use xfs_db and a bit of
>> +scripting to analyse the structure of a 100TB filesystem when trying to
>> +determine the root cause of a corruption problem, but it is still mainly a
>> +manual task of verifying that things like single bit errors or misplaced 
>> writes
>> +weren't the ultimate cause of a corruption event. It may take a few hours 
>> to a
>> +few days to perform such forensic analysis, so for at this scale root cause
>> +analysis is entirely possible.
>> +
>> +However, if we scale the filesystem up to 1PB, we now have 10x as much 
>> metadata
>> +to analyse and so that analysis blows out towards weeks/months of forensic 
>> work.
>> +Most of the analysis work is slow and tedious, so as the amount of analysis 
>> goes
>> +up, the more likely that the cause will be lost in the noise.  Hence the 
>> primary
>> +concern for supporting PB scale filesystems is minimising the time and 
>> effort
>> +required for basic forensic analysis of the filesystem structure.
>> +
>> +
>> +Self Describing Metadata
>> +------------------------
>> +
>> +One of the problems with the current metadata format is that apart from the
>> +magic number in the metadata block, we have no other way of identifying 
>> what it
>> +is supposed to be. We can't even identify if it is the right place. Put 
>> simply,
>> +you can't look at a single metadata block in isolation and say "yes, it is
>> +supposed to be there and the contents are valid".
>> +
>> +Hence most of the time spent on forensic analysis is spent doing basic
>> +verification of metadata values, looking for values that are in range (and 
>> hence
>> +not detected by automated verification checks) but are not correct. Finding 
>> and
>> +understanding how things like cross linked block lists (e.g. sibling
>> +pointers in a btree end up with loops in them) are the key to understanding 
>> what
>> +went wrong, but it is impossible to tell what order the blocks were linked 
>> into
>> +each other or written to disk after the fact.
>> +
>> +Hence we need to record more information into the metadata to allow us to
>> +quickly determine if the metadata is intact and can be ignored for the 
>> purpose
>> +of analysis. We can't protect against every possible type of error, but we 
>> can
>> +ensure that common types of errors are easily detectable.  Hence the 
>> concept of
>> +self describing metadata.
>> +
>> +The first, fundamental requirement of self describing metadata is that the
>> +metadata object contains some form of unique identifier in a well known
>> +location. This allows us to identify the expected contents of the block and
>> +hence parse and verify the metadata object. IF we can't independently 
>> identify
>> +the type of metadata in the object, then the metadata doesn't describe 
>> itself
>> +very well at all!
>> +
>> +Luckily, almost all XFS metadata has magic numbers embedded already - only 
>> the
>> +AGFL, remote symlinks and remote attribute blocks do not contain identifying
>> +magic numbers. Hence we can change the on-disk format of all these objects 
>> to
>> +add more identifying information and detect this simply by changing the 
>> magic
>> +numbers in the metadata objects. That is, if it has the current magic 
>> number,
>> +the metadata isn't self identifying. If it contains a new magic number, it 
>> is
>> +self identifying and we can do much more expansive automated verification 
>> of the
>> +metadata object at runtime, during forensic analysis or repair.
>> +
>> +As a primary concern, self describing metadata needs to some form of overall
> 
>                                                         ^^ scratch that
> 
>> +integrity checking. We cannot trust the metadata if we cannot verify that 
>> it has
>> +not been changed as a result of external influences. Hence we need some 
>> form of
>> +integrity check, and this is done by adding CRC32c validation to the 
>> metadata
>> +block. If we can verify the block contains the metadata it was intended to
>> +contain, a large amount of the manual verification work can be skipped.
>> +
>> +CRC32c was selected as metadata cannot be more than 64k in length in XFS and
>> +hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
>> +metadata blocks. CRC32c is also now hardware accelerated on common CPUs so 
>> it is
>> +fast. So while CRC32c is not the strongest of integrity checks that could be
> 
>                                                 ^ possible (perhaps)
> 
>> +used, it is more than sufficient for our needs and has relatively little
>> +overhead. Adding support for larger integrity fields and/or algorithms does
> 
>                                                                               
> n't
> 
>> +really provide any extra value over CRC32c, but it does add a lot of 
>> complexity
>> +and so there is no provision for changing the integrity checking mechanism.
>> +
>> +Self describing metadata needs to contain enough information so that the
>> +metadata block can be verified as being in the correct place without 
>> needing to
>> +look at any other metadata. This means it needs to contain location 
>> information.
>> +Just adding a block number to the metadata is not sufficient to protect 
>> against
>> +mis-directed writes - a write might be misdirected to the wrong LUN and so 
>> be
>> +written to the "correct block" of the wrong filesystem. Hence location
>> +information must contain a filesystem identifier as well as a block number.
>> +
>> +Another key information point in forensic analysis is knowing who the 
>> metadata
>> +block belongs to. We already know it's type, it's location, that it's valid
> 
>        shouldn't this spelled:       its        its 
> 
>> +and/or corrupted, and how long ago that it was last modified. Knowing the 
>> owner
>> +of the block is important as it allows us to find other related metadata to
>> +determine the scope of the corruption. For example, if we have a extent 
>> btree
>> +object, we don't know what inode it belongs to and hence have to walk the 
>> entire
>> +filesystem to find the owner of the block. Worse, the corruption could mean 
>> that
>> +no owner can be found (i.e. it's an orphan block), and so without an owner 
>> field
>> +in the metadata we have no idea of the scope of the corruption. If we have 
>> an
>> +owner field in the metadata object, we can immediately do top down 
>> validation to
>> +determine the scope of the problem.
>> +
>> +Different types of metadata have different owner identifiers. For example,
>> +directory, attribute and extent tree blocks are all owned by an inode, 
>> whilst
>> +freespace btree blocks are owned by an allocation group. Hence the size and
>> +contents of the owner field are determined by the type of metadata object 
>> we are
>> +looking at. For example, directories, extent maps and attributes are owned 
>> by
>> +inodes, while freespace btree blocks are owned by a specific allocation 
>> group.
>> +THe owner information can also identify misplaced writes (e.g. freespace 
>> btree
> 
>    The
> 
>> +block written to the wrong AG).
>> +
>> +Self describing metadata also needs to contain some indication of when it 
>> was
>> +written to the filesystem. One of the key information points when doing 
>> forensic
>> +analysis is how recently the block was modified. Correlation of set of 
>> corrupted
>> +metadata blocks based on modification times is important as it can indicate
>> +whether the corruptions are related, whether there's been multiple 
>> corruption
>> +events that lead to the eventual failure, and even whether there are 
>> corruptions
>> +present that the run-time verification is not detecting.
>> +
>> +For example, we can determine whether a metadata object is supposed to be 
>> free
>> +space or still allocated when it is still referenced by it's owner can be
> 
>                                                            its
                   allocated. When
> 
>> +determined by looking at when the free space btree block that contains the 
>> block
>> +was last written compared to when the metadata object itself was last 
>> written.
>> +If the free space block is more recent than the object and the objects 
>> owner,
                                                                object's
>> +then there is a very good chance that the block should have been removed 
>> from
>> +it's owner.
    its
>> +
>> +To provide this "written timestamp", each metadata block gets the Log 
>> Sequence
>> +Number (LSN) of the most recent transaction it was modified on written into 
>> it.
>> +This number will always increase over the life of the filesystem, and the 
>> only
>> +thing that resets it is running xfs_repair on the filesystem. Further, by 
>> use of
>> +the LSN we can tell if the corrupted metadata all belonged to the same log
>> +checkpoint and hence have some idea of how much modification occurred 
>> between
>> +the first and last instance of corrupt metadata on disk and, further, how 
>> much
>> +modification occurred between the corruption being written and  when it was
>> +detected.
>> +
>> +Runtime Validation
>> +------------------
>> +
>> +Validation of self-describing metadata takes place at runtime in two places:
>> +
>> +    - immediately after a successful read from disk
>> +    - immediately prior to write IO submission
>> +
>> +The verification is completely stateless - it is done independently of the
>> +modification process, and seeks only to check that the metadata is what it 
>> says
>> +it is and that the metadata fields are within bounds and internally 
>> consistent.
>> +As such, we cannot catch all types of corruption that can occur within a 
>> block
>> +as there may be certain limitations that operational state enforces of the
>> +metadata, or there may be corruption of interblock relationships (e.g. 
>> corrupted
>> +sibling pointer lists). Hence we still need stateful checking in the main 
>> code
>> +body, but in general most of the per-field validation is handled by the
>> +verifiers.
>> +
>> +For read verification, the caller needs to specify the expected type of 
>> metadata
>> +that it should see, and the IO completion process verifies that the metadata
>> +object matches what was expected. If the verification process fails, then it
>> +marks the object being read as EFSCORRUPTED. The caller needs to catch this
>> +error (same as for IO errors), and if it needs to take special action due 
>> to a
>> +verification error it can do so by catching the EFSCORRUPTED error value. 
>> If we
>> +need more discrimination of error type at higher levels, we can define new
>> +error numbers for different errors as necessary.
>> +
>> +The first step in read verification is checking the magic number and 
>> determining
>> +whether CRC validating is necessary. If it is, the CRC32c is caluclated and
> 
>                                                                    cu
> 
>> +compared against the value stored in the object itself. Once this is 
>> validated,
>> +further checks are made against the location information, followed by 
>> extensive
>> +object specific metadata validation. If any of these checks fail, then the
>> +buffer is considered corrupt and the EFSCORRUPTED error is set 
>> appropriately.
>> +
>> +Write verification is the opposite of the read verification - first the 
>> object
>> +is extensively verified and if it is OK we then update the LSN from the last
>> +modification made to the object, After this, we calculate the CRC and 
>> insert it
>> +into the object. Once this is done the write IO is allowed to continue. If 
>> any
>> +error occurs during this process, the buffer is again marked with a 
>> EFSCORRUPTED
>> +error for the higher layers to catch.
>> +
>> +Structures
>> +----------
>> +
>> +A typical on-disk structure needs to contain the following information:
>> +
>> +struct xfs_ondisk_hdr {
>> +        __be32  magic;              /* magic number */
>> +        __be32  crc;                /* CRC, not logged */
>> +        uuid_t  uuid;               /* filesystem identifier */
>> +        __be64  owner;              /* parent object */
>> +        __be64  blkno;              /* location on disk */
>> +        __be64  lsn;                /* last modification in log, not logged 
>> */
>> +};
>> +
>> +Depending on the metadata, this information may be part of a header stucture
> 
>                                                                        
> structure
> 
>> +separate to the metadata contents, or may be distributed through an existing
>> +structure. The latter occurs with metadata that already contains some of 
>> this
>> +information, such as the superblock and AG headers.
>> +
>> +Other metadata may have different formats for the information, but the same
>> +level of information is generally provided. For example:
>> +
>> +    - short btree blocks have a 32 bit owner (ag number) and a 32 bit block
>> +      number for location. The two of these combined provide the same
>> +      information as @owner and @blkno in eh above structure, but using 8
                                              the
>> +      bytes less space on disk.
>> +
>> +    - directory/attribute node blocks have a 16 bit magic number, and the
>> +      header that contains the magic number has other information in it as
>> +      well. hence the additional metadata headers change the overall format
>> +      of the metadata.
>> +
>> +A typical buffer read verifier is structured as follows:
>> +
>> +#define XFS_FOO_CRC_OFF             offsetof(struct xfs_ondisk_hdr, crc)
>> +
>> +static void
>> +xfs_foo_read_verify(
>> +    struct xfs_buf  *bp)
>> +{
>> +       struct xfs_mount *mp = bp->b_target->bt_mount;
>> +
>> +        if ((xfs_sb_version_hascrc(&mp->m_sb) &&
>> +             !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
>> +                                    XFS_FOO_CRC_OFF)) ||
>> +            !xfs_foo_verify(bp)) {
>> +                XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, 
>> bp->b_addr);
>> +                xfs_buf_ioerror(bp, EFSCORRUPTED);
>> +        }
>> +}
>> +
>> +The code ensures that the CRC is only checked if the filesystem has CRCs 
>> enabled
>> +by checking the superblock of the feature bit, and then if the CRC verifies 
>> OK
>> +(or is not needed) it then verifies the actual contents of the block.
> 
>                          ^^^^ scratch then perhaps
> 
>> +
>> +The verifier function will take a couple of different forms, depending on
>> +whether the magic number can be used to determine the format of the block. 
>> In
>> +the case it can't, the code will is structured as follows:
                   scratch will ^^^^
>> +
>> +static bool
>> +xfs_foo_verify(
>> +    struct xfs_buf          *bp)
>> +{
>> +        struct xfs_mount    *mp = bp->b_target->bt_mount;
>> +        struct xfs_ondisk_hdr       *hdr = bp->b_addr;
>> +
>> +        if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
>> +                return false;
>> +
>> +        if (!xfs_sb_version_hascrc(&mp->m_sb)) {
>> +            if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
>> +                    return false;
>> +            if (bp->b_bn != be64_to_cpu(hdr->blkno))
>> +                    return false;
>> +            if (hdr->owner == 0)
>> +                    return false;
>> +    }
>> +
>> +    /* object specific verification checks here */
>> +
>> +        return true;
>> +}
>> +
>> +If there are different magic numbers for the different formats, the verifier
>> +will look like:
>> +
>> +static bool
>> +xfs_foo_verify(
>> +    struct xfs_buf          *bp)
>> +{
>> +        struct xfs_mount    *mp = bp->b_target->bt_mount;
>> +        struct xfs_ondisk_hdr       *hdr = bp->b_addr;
>> +
>> +        if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
>> +            if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
>> +                    return false;
>> +            if (bp->b_bn != be64_to_cpu(hdr->blkno))
>> +                    return false;
>> +            if (hdr->owner == 0)
>> +                    return false;
>> +    } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
>> +            return false;
>> +
>> +    /* object specific verification checks here */
>> +
>> +        return true;
>> +}
>> +
>> +Write verifiers are very similar to the read verifiers, they just do things 
>> in
>> +the opposite order to the read verifiers. A typical write verifier:
>> +
>> +static void
>> +xfs_foo_write_verify(
>> +    struct xfs_buf  *bp)
>> +{
>> +    struct xfs_mount        *mp = bp->b_target->bt_mount;
>> +    struct xfs_buf_log_item *bip = bp->b_fspriv;
>> +
>> +    if (!xfs_foo_verify(bp)) {
>> +            XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, 
>> bp->b_addr);
>> +            xfs_buf_ioerror(bp, EFSCORRUPTED);
>> +            return;
>> +    }
>> +
>> +    if (!xfs_sb_version_hascrc(&mp->m_sb))
>> +            return;
>> +
>> +
>> +    if (bip) {
>> +            struct xfs_ondisk_hdr   *hdr = bp->b_addr;
>> +            hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
>> +    }
>> +    xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
>> +}
>> +
>> +This will verify the internal structure of the metadata before we go any
>> +further, detecting corruptions that have occurred as the metadata has been
>> +modified in memory. If the metadata verifies OK, and CRCs are enabled, we 
>> then
>> +update the LSN field (when it was last modified) and calculate the CRC on 
>> the
>> +metadata. Once this is done, we can issue the IO.
>> +
>> +Inodes and Dquots
>> +-----------------
>> +
>> +Inodes and dquots are special snowflakes. They have per-object CRC and
>> +self-identifiers, but they are packed so that there are multiple objects per
>> +buffer. Hence we do not use per-buffer verifiers to do the work of 
>> per-object
>> +verification and CRC calculations. The per-buffer verifiers simply perform 
>> basic
>> +identification of the buffer - that they contain inodes or dquots, and that
>> +there are magic numbers in all the expected spots. All further CRC and
>> +verification checks are done when each inode is read from or written back 
>> to the
>> +buffer.
>> +
>> +The structure of the verifiers and the identifiers checks is very similar 
>> to the
>> +buffer code described above. The only difference is where they are called. 
>> For
>> +example, inode read verification is done in xfs_iread() when the inode is 
>> first
>> +read out of the buffer and the struct xfs_inode is instantiated. The inode 
>> is
>> +already extensively verified during writeback in xfs_iflush_int, so the only
>> +addition here add the LSN and CRC to the inode as it is copied back into the
>                 ^   
>               is to
> 
>> +buffer.
>> +
>> +XXX: inode unlinked list modification doesn't recalculate the inode CRC! 
>> None of
>> +the unlinked list modifications check or update CRCs, neither during unlink 
>> nor
>> +log recovery. So, it's gone unnoticed until now. This won't matter 
>> immediately -
>> +repair will probably complain about it - but it needs to be fixed.
>> +
> 
> Cheers,
> Pete

Cheers, Dave

<Prev in Thread] Current Thread [Next in Thread>