Focusing on Ext4 and XFS TRIM Operations

I’ve been doing some testing that requires rather large file systems. I have an EMC XtremIO Dual X-Brick array from which I provision a 10 terabyte volume. Volumes in XtremIO are always thinly provisioned. The testing I’m doing required me to scrutinize default Linux mkfs(8) behavior for both Ext4 and XFS. This is part 1 in a short series and it is about Ext4.

Discard the Discard Option

The first thing I noticed in this testing was the fantastical “throughput” demonstrated at the array while running the mkfs(8) command with the “-t ext4” option/arg pair. As the following screen shot shows the “throughput” at the array level was just shy of 72GB/s.

That’s not real I/O…I’ll explain…

EMC XtremIO Dual X-Brick Array During Ext4 mkfs(8). Default Options.

EMC XtremIO Dual X-Brick Array During Ext4 mkfs(8). Default Options.

The default options for Ext4 include the discard (TRIM under the covers) option. The mkfs(8) manpage has this to say about the discard option :

Attempt to discard blocks at mkfs time (discarding blocks initially is useful on solid state devices and sparse / thin-provisioned storage). When the device advertises that discard also zeroes data (any subsequent read after the discard and before write returns zero), then mark all not-yet-zeroed inode tables as zeroed. This significantly speeds up filesystem initialization. This is set as default.

I’ve read that quoted text at least eleventeen times but the wording still sounds like gibberish-scented gobbledygook to me–well, except for the bit about significantly speeding up filesystem initialization.

Since XtremIO volumes are created thin I don’t see any reason for mkfs to take action to make it, what, thinner?  Please let me share test results challenging the assertion that the discard mkfs option results in faster file system initialization. This is the default functionality after all.

In the following terminal output you’ll see that the default mkfs options take 152 seconds to make a file system on a freshly-created 10TB XtremIO volume:


# time mkfs -t ext4 /dev/xtremio/fs/test
mke2fs 1.43-WIP (20-Jun-2013)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
real 2m32.055s
user 0m3.648s
sys 0m17.280s
#

The mkfs(8) Command Without Default Discard Functionality

Please bear in mind that default 152 second result is not due to languishing on pathetic physical I/O. The storage is fast. Please consider the following terminal output where I passed in the non-default -E option with the nodiscard argument. The file system creation took 4.8 seconds:

# time mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

real 0m4.856s
user 0m4.264s
sys 0m0.415s
#

I think 152 seconds down to 4.8 makes the point that with proper, thinly-provisioned storage the mkfs discard option does not “significantly speed up filesystem initialization.” But initializing file systems is not something one does frequently so investigation into the discard mount(8) option was in order.

Taking Ext4 For A Drive

Since I had this 10TB Ext4 file system–and a fresh focus on file system discard (storage TRIM) features–I thought I’d take it for a drive.

Discarded the Default Discard But Added The Non-Default Discard

While the default mkfs(8) command includes discard, the mount(8) command does not. I decided to investigate this option while unlinking a reasonable number of large files. To do so I ran a simple script (shown below) that copies 64 files of 16 gigabytes each–in parallel–into the Ext4 file system. I then timed a single invocation of the rm(1) command to remove all 64 of these files. Unlinking file in a Linux file system is a metadata operation, however, when the discard option is used to mount the file system each unlink operation includes TRIM operations being sent to storage. The following screen shot of the XtremIO performance dashboard was taken while the rm(1) command was running. The discard mount option turns a metadata operation into a rather costly storage operation.

Array Level Activity During Bulk rm(1) Command Processing. Ext4 (discard mount option)

Array Level Activity During Bulk rm(1) Command Processing. Ext4 (discard mount option)

The following terminal output shows the test step sequence used to test the discard mount option:

# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 -o discard /dev/xtremio/fs/test /mnt
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# cd mnt
# cat > cpit
for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done
wait
# time sh ./cpit > /dev/null 2>&1 

real 5m31.530s
user 0m2.906s
sys 8m45.292s
# du -sh .
1018G .
# time rm -f file*

real 4m52.608s
user 0m0.000s
sys 0m0.497s
#

The following terminal output shows the same test repeated with the file system being mounted with the default (thus no discard) mount options:

# cd ..
# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 /dev/xtremio/fs/test /mnt
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# cd mnt
# cat > cpit
for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done
wait
#
# time sh ./cpit > /dev/null 2>&1 

real 5m31.526s
user 0m2.957s
sys 8m50.317s
# time rm -f file*

real 0m16.398s
user 0m0.001s
sys 0m0.750s
#

This testing shows that mounting an Ext4 file system with the discard mount option dramatically impacts file removal operations. The default mount options (thus no discard option) performed the rm(1) command in 16 seconds whereas the same test took 292 seconds when mounted with the discard mount option.

So how can one perform the important house-cleaning that comes with TRIM operations?

The fstrim(8) Command

Ext4 supports user-invoked, online TRIM operations on mounted file systems. I would advise people to forego the discard mount option and opt for occasionally running the fstrim(8) command. The following is an example of  how long it takes to execute fstrim on the same 10TB file system stored in an EMC XtremIO array. I think that foregoing the taxation of commands like rm(1) is a good thing–especially since running fstrim is allowed on mounted file systems and only takes roughly 11 minutes on a 10TB file system.

# time fstrim -v /mnt
/mnt: 10908310835200 bytes were trimmed

real 11m29.325s
user 0m0.000s
sys 2m31.370s
#

Summary

If you use thinly-provisioned storage and want file deletion in Ext4 to return space to the array you have a choice. You can choose to take serious performance hits when you create the file system (default mkfs(8) options) and when you delete files (optional discard mount(8) option) or you can occasionally execute the fstrim(8) command on a mounted file system.

Up Next

The next post in this series will focus on XFS.

12 Responses to “Focusing on Ext4 and XFS TRIM Operations”


  1. 1 markcallaghan (@markcallaghan) July 19, 2015 at 2:50 pm

    Are you offering this as universal advice or is it limited to Ext4 and/or XtremIO?

    “I would advise people to forego the discard mount option and opt for occasionally running the fstrim(8) command.”

    • 2 kevinclosson July 19, 2015 at 3:06 pm

      Hi Mark,

      That is my recommendation because I have not found a use case where taxing rm(1) so badly makes more sense then the occasional interactive fstrim(8). Do you have a counter-use case? If so please share.

      FWIW, the XFS installment is coming up too…

      • 3 markcallaghan (@markcallaghan) July 19, 2015 at 7:02 pm

        I have no experience with fstrim, only with discard as a mount option. Just browsed the Linux man page for fstrim and I don’t think their advice about running fstrim once per week applies to my workloads.

        We depend on discard. There have been a few cases where a few collections of servers ran without discard and suffered both from excessive wear and perf stalls because write-amp was much larger than needed. This was fixed by mounting with discard. Using fstrim wasn’t considered, so I can’t compare it and I realize this discussion is about fstrim vs discard. So far I am making the case for TRIM whether that is via discard or fstrim.

        The original workload I cared about was InnoDB and unlink was infrequent, so unlink performance wasn’t a priority. I have been spending more time with LSMs, mostly RocksDB, over the past few years and that unlinks files all of the time but in background threads so unlink perf hasn’t been a priority.

        From test databases I am using right now with XFS + discard and FusionIO. I timed ‘rm -rf’, but didn’t count the number of database files prior to this. The directories were from different products so the number of files varied greatly:
        * 683gb -> 28 seconds
        * 518gb -> 40 seconds
        * 2tb -> 68 seconds
        * 514gb -> 30 seconds

  2. 9 markcallaghan (@markcallaghan) July 23, 2015 at 7:17 am

    Wasn’t willing to wait for your post about XFS. I try to avoid the ext family for database workloads (no concurrent writes, stalls from sync after write-append, etc). I am testing MongoDB with different storage engines — WiredTiger, RocksDB, mmapv1 and TokuMX. The server has SW RAID 0 over 15 disks and XFS with 2MB stripes. At test end I do ‘rm -r data; mkdir data’ and in a few cases that has been incredibly slow but only for mmapv1 database. Now, mmapv1 gets a larger database because it doesn’t use compression, gets bad b-tree fragmentation and does “power-of-2” space allocation. So the database is ~1.5T versus ~500gb on the other hosts. Via top I see lots of CPU for the rm command and via perf I see the top N sources of CPU. This is XFS fragmentation. With mmapv1, MongoDB stores the database across files of size 2G. From past experience, MongoDB+mmapv1 uses posix_fallocate to make the database files 2G before using them — http://smalldatum.blogspot.com/2014/03/redo-logs-in-mongodb-and-innodb.html

    6.88% rm [kernel.kallsyms] [k] _xfs_buf_find
    5.35% rm [kernel.kallsyms] [k] memcpy
    4.56% rm [kernel.kallsyms] [k] xfs_log_commit_cil
    3.56% rm [kernel.kallsyms] [k] xfs_trans_buf_item_match
    3.31% rm [kernel.kallsyms] [k] memset
    3.26% rm [kernel.kallsyms] [k] xfs_btree_lookup
    2.81% rm [kernel.kallsyms] [k] xfs_extent_busy_insert
    2.62% rm [kernel.kallsyms] [k] xfs_next_bit
    2.28% rm [kernel.kallsyms] [k] kmem_cache_alloc


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.




DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 744 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: