I’ve been doing some testing that requires rather large file systems. I have an EMC XtremIO Dual X-Brick array from which I provision a 10 terabyte volume. Volumes in XtremIO are always thinly provisioned. The testing I’m doing required me to scrutinize default Linux mkfs(8) behavior for both Ext4 and XFS. This is part 1 in a short series and it is about Ext4.
Discard the Discard Option
The first thing I noticed in this testing was the fantastical “throughput” demonstrated at the array while running the mkfs(8) command with the “-t ext4” option/arg pair. As the following screen shot shows the “throughput” at the array level was just shy of 72GB/s.
That’s not real I/O…I’ll explain…
The default options for Ext4 include the discard (TRIM under the covers) option. The mkfs(8) manpage has this to say about the discard option :
Attempt to discard blocks at mkfs time (discarding blocks initially is useful on solid state devices and sparse / thin-provisioned storage). When the device advertises that discard also zeroes data (any subsequent read after the discard and before write returns zero), then mark all not-yet-zeroed inode tables as zeroed. This significantly speeds up filesystem initialization. This is set as default.
I’ve read that quoted text at least eleventeen times but the wording still sounds like gibberish-scented gobbledygook to me–well, except for the bit about significantly speeding up filesystem initialization.
Since XtremIO volumes are created thin I don’t see any reason for mkfs to take action to make it, what, thinner? Please let me share test results challenging the assertion that the discard mkfs option results in faster file system initialization. This is the default functionality after all.
In the following terminal output you’ll see that the default mkfs options take 152 seconds to make a file system on a freshly-created 10TB XtremIO volume:
# time mkfs -t ext4 /dev/xtremio/fs/test mke2fs 1.43-WIP (20-Jun-2013) Discarding device blocks: done Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=2 blocks, Stripe width=16 blocks 335544320 inodes, 2684354560 blocks 134217728 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 81920 block groups 32768 blocks per group, 32768 fragments per group 4096 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done real 2m32.055s user 0m3.648s sys 0m17.280s #
The mkfs(8) Command Without Default Discard Functionality
Please bear in mind that default 152 second result is not due to languishing on pathetic physical I/O. The storage is fast. Please consider the following terminal output where I passed in the non-default -E option with the nodiscard argument. The file system creation took 4.8 seconds:
# time mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test mke2fs 1.43-WIP (20-Jun-2013) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=2 blocks, Stripe width=16 blocks 335544320 inodes, 2684354560 blocks 134217728 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 81920 block groups 32768 blocks per group, 32768 fragments per group 4096 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done real 0m4.856s user 0m4.264s sys 0m0.415s #
I think 152 seconds down to 4.8 makes the point that with proper, thinly-provisioned storage the mkfs discard option does not “significantly speed up filesystem initialization.” But initializing file systems is not something one does frequently so investigation into the discard mount(8) option was in order.
Taking Ext4 For A Drive
Since I had this 10TB Ext4 file system–and a fresh focus on file system discard (storage TRIM) features–I thought I’d take it for a drive.
Discarded the Default Discard But Added The Non-Default Discard
While the default mkfs(8) command includes discard, the mount(8) command does not. I decided to investigate this option while unlinking a reasonable number of large files. To do so I ran a simple script (shown below) that copies 64 files of 16 gigabytes each–in parallel–into the Ext4 file system. I then timed a single invocation of the rm(1) command to remove all 64 of these files. Unlinking file in a Linux file system is a metadata operation, however, when the discard option is used to mount the file system each unlink operation includes TRIM operations being sent to storage. The following screen shot of the XtremIO performance dashboard was taken while the rm(1) command was running. The discard mount option turns a metadata operation into a rather costly storage operation.
The following terminal output shows the test step sequence used to test the discard mount option:
# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 -o discard /dev/xtremio/fs/test /mnt mke2fs 1.43-WIP (20-Jun-2013) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=2 blocks, Stripe width=16 blocks 335544320 inodes, 2684354560 blocks 134217728 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 81920 block groups 32768 blocks per group, 32768 fragments per group 4096 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done # cd mnt # cat > cpit for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done wait # time sh ./cpit > /dev/null 2>&1 real 5m31.530s user 0m2.906s sys 8m45.292s # du -sh . 1018G . # time rm -f file* real 4m52.608s user 0m0.000s sys 0m0.497s #
The following terminal output shows the same test repeated with the file system being mounted with the default (thus no discard) mount options:
# cd .. # umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 /dev/xtremio/fs/test /mnt mke2fs 1.43-WIP (20-Jun-2013) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=2 blocks, Stripe width=16 blocks 335544320 inodes, 2684354560 blocks 134217728 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 81920 block groups 32768 blocks per group, 32768 fragments per group 4096 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done # cd mnt # cat > cpit for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done wait # # time sh ./cpit > /dev/null 2>&1 real 5m31.526s user 0m2.957s sys 8m50.317s # time rm -f file* real 0m16.398s user 0m0.001s sys 0m0.750s #
This testing shows that mounting an Ext4 file system with the discard mount option dramatically impacts file removal operations. The default mount options (thus no discard option) performed the rm(1) command in 16 seconds whereas the same test took 292 seconds when mounted with the discard mount option.
So how can one perform the important house-cleaning that comes with TRIM operations?
The fstrim(8) Command
Ext4 supports user-invoked, online TRIM operations on mounted file systems. I would advise people to forego the discard mount option and opt for occasionally running the fstrim(8) command. The following is an example of how long it takes to execute fstrim on the same 10TB file system stored in an EMC XtremIO array. I think that foregoing the taxation of commands like rm(1) is a good thing–especially since running fstrim is allowed on mounted file systems and only takes roughly 11 minutes on a 10TB file system.
# time fstrim -v /mnt /mnt: 10908310835200 bytes were trimmed real 11m29.325s user 0m0.000s sys 2m31.370s #
Summary
If you use thinly-provisioned storage and want file deletion in Ext4 to return space to the array you have a choice. You can choose to take serious performance hits when you create the file system (default mkfs(8) options) and when you delete files (optional discard mount(8) option) or you can occasionally execute the fstrim(8) command on a mounted file system.
Up Next
The next post in this series will focus on XFS.
Are you offering this as universal advice or is it limited to Ext4 and/or XtremIO?
“I would advise people to forego the discard mount option and opt for occasionally running the fstrim(8) command.”
Hi Mark,
That is my recommendation because I have not found a use case where taxing rm(1) so badly makes more sense then the occasional interactive fstrim(8). Do you have a counter-use case? If so please share.
FWIW, the XFS installment is coming up too…
I have no experience with fstrim, only with discard as a mount option. Just browsed the Linux man page for fstrim and I don’t think their advice about running fstrim once per week applies to my workloads.
We depend on discard. There have been a few cases where a few collections of servers ran without discard and suffered both from excessive wear and perf stalls because write-amp was much larger than needed. This was fixed by mounting with discard. Using fstrim wasn’t considered, so I can’t compare it and I realize this discussion is about fstrim vs discard. So far I am making the case for TRIM whether that is via discard or fstrim.
The original workload I cared about was InnoDB and unlink was infrequent, so unlink performance wasn’t a priority. I have been spending more time with LSMs, mostly RocksDB, over the past few years and that unlinks files all of the time but in background threads so unlink perf hasn’t been a priority.
From test databases I am using right now with XFS + discard and FusionIO. I timed ‘rm -rf’, but didn’t count the number of database files prior to this. The directories were from different products so the number of files varied greatly:
* 683gb -> 28 seconds
* 518gb -> 40 seconds
* 2tb -> 68 seconds
* 514gb -> 30 seconds
Hi Mark,
Thanks for stopping by. I do suppose the main issue is whether a) folks find themselves routinely creating file systems or b) have an application that is sensitive to file deletions.
Just out of curiosity, can you say a word about your XFS file system where you deleted 2TB of files? That must have been LVM stitching together a few FusionIO cards, no?
FusionIO ioScale with more than 3T after formatting. No LVM.
Ah! I was unaware they had such large capacity cards. Do they still put FTL burden on the host?
yes
I’m curious what you guys think of the dilemma we find ourselves in with XFS, EXT4, fstrim, and discard using MongoDB with the WiredTiger storage engine on SSDs:
https://groups.google.com/forum/#!topic/mongodb-user/Mj0x6m-02Ms/discussion
In particular, Mark, I’d be curious what you think of running XFS with discard given the comments we found from XFS developers themselves with dire warnings of corrupt data if used.
Wasn’t willing to wait for your post about XFS. I try to avoid the ext family for database workloads (no concurrent writes, stalls from sync after write-append, etc). I am testing MongoDB with different storage engines — WiredTiger, RocksDB, mmapv1 and TokuMX. The server has SW RAID 0 over 15 disks and XFS with 2MB stripes. At test end I do ‘rm -r data; mkdir data’ and in a few cases that has been incredibly slow but only for mmapv1 database. Now, mmapv1 gets a larger database because it doesn’t use compression, gets bad b-tree fragmentation and does “power-of-2” space allocation. So the database is ~1.5T versus ~500gb on the other hosts. Via top I see lots of CPU for the rm command and via perf I see the top N sources of CPU. This is XFS fragmentation. With mmapv1, MongoDB stores the database across files of size 2G. From past experience, MongoDB+mmapv1 uses posix_fallocate to make the database files 2G before using them — http://smalldatum.blogspot.com/2014/03/redo-logs-in-mongodb-and-innodb.html
6.88% rm [kernel.kallsyms] [k] _xfs_buf_find
5.35% rm [kernel.kallsyms] [k] memcpy
4.56% rm [kernel.kallsyms] [k] xfs_log_commit_cil
3.56% rm [kernel.kallsyms] [k] xfs_trans_buf_item_match
3.31% rm [kernel.kallsyms] [k] memset
3.26% rm [kernel.kallsyms] [k] xfs_btree_lookup
2.81% rm [kernel.kallsyms] [k] xfs_extent_busy_insert
2.62% rm [kernel.kallsyms] [k] xfs_next_bit
2.28% rm [kernel.kallsyms] [k] kmem_cache_alloc
Thanks for that detail, Mark. However, I can’t tell if this feedback is additive or to refute any of the findings I’m sharing though.
I think it is orthogonal — just one more thing to worry about when doing storage-intensive work with XFS. Fragmentation happens in many places — b-trees, SSDs, filesystems, memory allocators.
Thanks, Mark.