Am Samstag, 28. Januar 2012 schrieb Eric Sandeen:
> On 1/28/12 8:55 AM, Martin Steigerwald wrote:
> > Am Freitag, 27. Januar 2012 schrieb Eric Sandeen:
> >> On 1/27/12 1:50 AM, Manny wrote:
> >>> Hi there,
> >>>
> >>> I'm not sure if this is intended behavior, but I was a bit stumped
> >>> when I formatted a 30TB volume (12x3TB minus 2x3TB for parity in
> >>> RAID 6) with XFS and noticed that there were only 22 TB left. I
> >>> just called mkfs.xfs with default parameters - except for swith
> >>> and sunit which match the RAID setup.
> >>>
> >>> Is it normal that I lost 8TB just for the file system? That's
> >>> almost 30% of the volume. Should I set the block size higher? Or
> >>> should I increase the number of allocation groups? Would that make
> >>> a difference? Whats the preferred method for handling such large
> >>> volumes?
> >>
> >> If it was 12x3TB I imagine you're confusing TB with TiB, so
> >> perhaps your 30T is really only 27TiB to start with.
> >>
> >> Anyway, fs metadata should not eat much space:
> >>
> >> # mkfs.xfs -dfile,name=fsfile,size=30t
> >> # ls -lh fsfile
> >> -rw-r--r-- 1 root root 30T Jan 27 12:18 fsfile
> >> # mount -o loop fsfile mnt/
> >> # df -h mnt
> >> Filesystem Size Used Avail Use% Mounted on
> >> /tmp/fsfile 30T 5.0M 30T 1% /tmp/mnt
> >>
> >> So Christoph's question was a good one; where are you getting
> >> your sizes?
>
> To solve your original problem, can you answer the above question?
> Adding your actual raid config output (/proc/mdstat maybe) would help
> too.
Eric, I wrote
> > An academic question:
to make clear that it was just something I was curious about.
I was not the reporter of the problem anyway, I have no problem,
the reporter has no problem, see his answer, so all is good ;)
With your hint and some thinking / testing through it I was able to
resolve most of my other questions. Thanks.
For the gory details:
> > Why is it that I get
[…]
> > merkaba:/tmp> LANG=C df -hT /mnt/zeit
> > Filesystem Type Size Used Avail Use% Mounted on
> > /dev/loop0 xfs 30T 33M 30T 1% /mnt/zeit
> >
> >
> > 33MiB used on first mount instead of 5?
>
> Not sure offhand, differences in xfsprogs version mkfs defaults
> perhaps.
Okay, thats fine with me. I was just curious. It doesn´t matter much.
> > Hmmm, but creating the file on Ext4 does not work:
> ext4 is not designed to handle very large files, so anything
> above 16T will fail.
>
> > fallocate instead of sparse file?
>
> no, you just ran into file offset limits on ext4.
Oh, yes. Completely forgot about these Ext4 limits. Sorry.
> > And on BTRFS as well as XFS it appears to try to create a 30T file
> > for real, i.e. by writing data - I stopped it before it could do too
> > much harm.
>
> Why do you say that it appears to create a 30T file for real? It
> should not...
I jumped to a conclusion too quickly. It did do a I/O storm onto the
Intel SSD 320:
martin@merkaba:~> vmstat -S M 1 (not applied to bi/bo)
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 1630 4365 87 1087 0 0 101 53 7 81 5 2 93 0
1 0 1630 4365 87 1087 0 0 0 0 428 769 1 0 99 0
2 0 1630 4365 87 1087 0 0 0 0 426 740 1 1 99 0
0 0 1630 4358 87 1088 0 0 0 0 1165 2297 4 7 89 0
0 0 1630 4357 87 1088 0 0 0 40 1736 3434 8 6 86 0
0 0 1630 4357 87 1088 0 0 0 0 614 1121 3 1 96 0
0 0 1630 4357 87 1088 0 0 0 32 359 636 0 0 100 0
1 1 1630 3852 87 1585 0 0 13 81540 529 1045 1 7 91 1
0 3 1630 3398 87 2027 0 0 0 227940 1357 2764 0 9 54 37
4 3 1630 3225 87 2188 0 0 0 212004 2346 4796 5 6 41 49
1 3 1630 2992 87 2415 0 0 0 215608 1825 3821 1 6 42 50
0 2 1630 2820 87 2582 0 0 0 200492 1476 3089 3 6 49 41
1 1 1630 2569 87 2832 0 0 0 198156 1250 2508 0 6 59 34
0 2 1630 2386 87 3009 0 0 0 229896 1301 2611 1 6 56 37
0 2 1630 2266 87 3126 0 0 0 302876 1067 2093 0 5 62 33
1 3 1630 2266 87 3126 0 0 0 176092 723 1321 0 3 71 26
0 3 1630 2266 87 3126 0 0 0 163840 706 1351 0 1 74 25
0 1 1630 2266 87 3126 0 0 0 80104 3137 6228 1 4 69 26
0 0 1630 2267 87 3126 0 0 0 3 3505 7035 6 3 86 5
0 0 1630 2266 87 3126 0 0 0 0 631 1203 4 1 95 0
0 0 1630 2259 87 3127 0 0 0 0 715 1398 4 2 94 0
2 0 1630 2259 87 3127 0 0 0 0 1501 3087 10 3 86 0
0 0 1630 2259 87 3127 0 0 0 27 945 1883 5 2 93 0
0 0 1630 2259 87 3127 0 0 0 0 399 713 1 0 99 0
^C
But then stopped. Thus mkfs.xfs was just writing metadata it seems
and I didn´t see this in the tmpfs obviously.
But when I review it, creating a 30TB XFS filesystem should involve writing
some metadata at different places of the file.
I get:
merkaba:/mnt/zeit> LANG=C xfs_bmap fsfile
fsfile:
0: [0..255]: 96..351
1: [256..2147483639]: hole
2: [2147483640..2147483671]: 3400032..3400063
3: [2147483672..4294967279]: hole
4: [4294967280..4294967311]: 3400064..3400095
5: [4294967312..6442450919]: hole
6: [6442450920..6442450951]: 3400096..3400127
7: [6442450952..8589934559]: hole
8: [8589934560..8589934591]: 3400128..3400159
9: [8589934592..10737418199]: hole
10: [10737418200..10737418231]: 3400160..3400191
11: [10737418232..12884901839]: hole
12: [12884901840..12884901871]: 3400192..3400223
13: [12884901872..15032385479]: hole
14: [15032385480..15032385511]: 3400224..3400255
15: [15032385512..17179869119]: hole
16: [17179869120..17179869151]: 3400256..3400287
17: [17179869152..19327352759]: hole
18: [19327352760..19327352791]: 3400296..3400327
19: [19327352792..21474836399]: hole
20: [21474836400..21474836431]: 3400328..3400359
21: [21474836432..23622320039]: hole
22: [23622320040..23622320071]: 3400360..3400391
23: [23622320072..25769803679]: hole
24: [25769803680..25769803711]: 3400392..3400423
25: [25769803712..27917287319]: hole
26: [27917287320..27917287351]: 3400424..3400455
27: [27917287352..30064770959]: hole
28: [30064770960..30064770991]: 3400456..3400487
29: [30064770992..32212254599]: hole
30: [32212254600..32212254631]: 3400488..3400519
31: [32212254632..32215654311]: 352..3400031
32: [32215654312..32216428455]: 3400520..4174663
33: [32216428456..34359738239]: hole
34: [34359738240..34359738271]: 4174664..4174695
35: [34359738272..36507221879]: hole
36: [36507221880..36507221911]: 4174696..4174727
37: [36507221912..38654705519]: hole
38: [38654705520..38654705551]: 4174728..4174759
39: [38654705552..40802189159]: hole
40: [40802189160..40802189191]: 4174760..4174791
41: [40802189192..42949672799]: hole
42: [42949672800..42949672831]: 4174792..4174823
43: [42949672832..45097156439]: hole
44: [45097156440..45097156471]: 4174824..4174855
45: [45097156472..47244640079]: hole
46: [47244640080..47244640111]: 4174856..4174887
47: [47244640112..49392123719]: hole
48: [49392123720..49392123751]: 4174888..4174919
49: [49392123752..51539607359]: hole
50: [51539607360..51539607391]: 4174920..4174951
51: [51539607392..53687090999]: hole
52: [53687091000..53687091031]: 4174952..4174983
53: [53687091032..55834574639]: hole
54: [55834574640..55834574671]: 4174984..4175015
55: [55834574672..57982058279]: hole
56: [57982058280..57982058311]: 4175016..4175047
57: [57982058312..60129541919]: hole
58: [60129541920..60129541951]: 4175048..4175079
59: [60129541952..62277025559]: hole
60: [62277025560..62277025591]: 4175080..4175111
61: [62277025592..64424509191]: hole
62: [64424509192..64424509199]: 4175112..4175119
Okay, it needed to write 2 GB:
merkaba:/mnt/zeit> du -h fsfile
2,0G fsfile
merkaba:/mnt/zeit> du --apparent-size -h fsfile
30T fsfile
merkaba:/mnt/zeit>
I didn´t expect mkfs.xfs to write 2 GB, but when thinking through it
for a 30 TB filesystem I find this reasonable.
Still it has 33 MiB for metadata:
merkaba:/mnt/zeit> mkdir bigfilefs
merkaba:/mnt/zeit> mount -o loop fsfile bigfilefs
merkaba:/mnt/zeit> LANG=C df -hT bigfilefs
Filesystem Type Size Used Avail Use% Mounted on
/dev/loop0 xfs 30T 33M 30T 1% /mnt/zeit/bigfilefs
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
|