Focusing on Ext4 and XFS TRIM Operations

Published July 19, 2015 oracle 12 Comments

I’ve been doing some testing that requires rather large file systems. I have an EMC XtremIO Dual X-Brick array from which I provision a 10 terabyte volume. Volumes in XtremIO are always thinly provisioned. The testing I’m doing required me to scrutinize default Linux mkfs(8) behavior for both Ext4 and XFS. This is part 1 in a short series and it is about Ext4.

Discard the Discard Option

The first thing I noticed in this testing was the fantastical “throughput” demonstrated at the array while running the mkfs(8) command with the “-t ext4” option/arg pair. As the following screen shot shows the “throughput” at the array level was just shy of 72GB/s.

That’s not real I/O…I’ll explain…

EMC XtremIO Dual X-Brick Array During Ext4 mkfs(8). Default Options.

The default options for Ext4 include the discard (TRIM under the covers) option. The mkfs(8) manpage has this to say about the discard option :

Attempt to discard blocks at mkfs time (discarding blocks initially is useful on solid state devices and sparse / thin-provisioned storage). When the device advertises that discard also zeroes data (any subsequent read after the discard and before write returns zero), then mark all not-yet-zeroed inode tables as zeroed. This significantly speeds up filesystem initialization. This is set as default.

I’ve read that quoted text at least eleventeen times but the wording still sounds like gibberish-scented gobbledygook to me–well, except for the bit about significantly speeding up filesystem initialization.

Since XtremIO volumes are created thin I don’t see any reason for mkfs to take action to make it, what, thinner? Please let me share test results challenging the assertion that the discard mkfs option results in faster file system initialization. This is the default functionality after all.

In the following terminal output you’ll see that the default mkfs options take 152 seconds to make a file system on a freshly-created 10TB XtremIO volume:


# time mkfs -t ext4 /dev/xtremio/fs/test
mke2fs 1.43-WIP (20-Jun-2013)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
real 2m32.055s
user 0m3.648s
sys 0m17.280s
#

The mkfs(8) Command Without Default Discard Functionality

Please bear in mind that default 152 second result is not due to languishing on pathetic physical I/O. The storage is fast. Please consider the following terminal output where I passed in the non-default -E option with the nodiscard argument. The file system creation took 4.8 seconds:

# time mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

real 0m4.856s
user 0m4.264s
sys 0m0.415s
#

I think 152 seconds down to 4.8 makes the point that with proper, thinly-provisioned storage the mkfs discard option does not “significantly speed up filesystem initialization.” But initializing file systems is not something one does frequently so investigation into the discard mount(8) option was in order.

Taking Ext4 For A Drive

Since I had this 10TB Ext4 file system–and a fresh focus on file system discard (storage TRIM) features–I thought I’d take it for a drive.

Discarded the Default Discard But Added The Non-Default Discard

While the default mkfs(8) command includes discard, the mount(8) command does not. I decided to investigate this option while unlinking a reasonable number of large files. To do so I ran a simple script (shown below) that copies 64 files of 16 gigabytes each–in parallel–into the Ext4 file system. I then timed a single invocation of the rm(1) command to remove all 64 of these files. Unlinking file in a Linux file system is a metadata operation, however, when the discard option is used to mount the file system each unlink operation includes TRIM operations being sent to storage. The following screen shot of the XtremIO performance dashboard was taken while the rm(1) command was running. The discard mount option turns a metadata operation into a rather costly storage operation.

Array Level Activity During Bulk rm(1) Command Processing. Ext4 (discard mount option)

The following terminal output shows the test step sequence used to test the discard mount option:

# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 -o discard /dev/xtremio/fs/test /mnt
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# cd mnt
# cat &gt; cpit
for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done
wait
# time sh ./cpit &gt; /dev/null 2&gt;&1 

real 5m31.530s
user 0m2.906s
sys 8m45.292s
# du -sh .
1018G .
# time rm -f file*

real 4m52.608s
user 0m0.000s
sys 0m0.497s
#

The following terminal output shows the same test repeated with the file system being mounted with the default (thus no discard) mount options:

# cd ..
# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 /dev/xtremio/fs/test /mnt
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# cd mnt
# cat &gt; cpit
for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done
wait
#
# time sh ./cpit &gt; /dev/null 2&gt;&1 

real 5m31.526s
user 0m2.957s
sys 8m50.317s
# time rm -f file*

real 0m16.398s
user 0m0.001s
sys 0m0.750s
#

This testing shows that mounting an Ext4 file system with the discard mount option dramatically impacts file removal operations. The default mount options (thus no discard option) performed the rm(1) command in 16 seconds whereas the same test took 292 seconds when mounted with the discard mount option.

So how can one perform the important house-cleaning that comes with TRIM operations?

The fstrim(8) Command

Ext4 supports user-invoked, online TRIM operations on mounted file systems. I would advise people to forego the discard mount option and opt for occasionally running the fstrim(8) command. The following is an example of how long it takes to execute fstrim on the same 10TB file system stored in an EMC XtremIO array. I think that foregoing the taxation of commands like rm(1) is a good thing–especially since running fstrim is allowed on mounted file systems and only takes roughly 11 minutes on a 10TB file system.

# time fstrim -v /mnt
/mnt: 10908310835200 bytes were trimmed

real 11m29.325s
user 0m0.000s
sys 2m31.370s
#

Summary

If you use thinly-provisioned storage and want file deletion in Ext4 to return space to the array you have a choice. You can choose to take serious performance hits when you create the file system (default mkfs(8) options) and when you delete files (optional discard mount(8) option) or you can occasionally execute the fstrim(8) command on a mounted file system.

Up Next

The next post in this series will focus on XFS.

12 Responses to “Focusing on Ext4 and XFS TRIM Operations”

Feed for this Entry Trackback Address

1 markcallaghan (@markcallaghan) July 19, 2015 at 2:50 pm

Are you offering this as universal advice or is it limited to Ext4 and/or XtremIO?

“I would advise people to forego the discard mount option and opt for occasionally running the fstrim(8) command.”

Reply
- 2 kevinclosson July 19, 2015 at 3:06 pm
  
  Hi Mark,
  
  That is my recommendation because I have not found a use case where taxing rm(1) so badly makes more sense then the occasional interactive fstrim(8). Do you have a counter-use case? If so please share.
  
  FWIW, the XFS installment is coming up too…
  
  Reply
  - 3 markcallaghan (@markcallaghan) July 19, 2015 at 7:02 pm
    
    I have no experience with fstrim, only with discard as a mount option. Just browsed the Linux man page for fstrim and I don’t think their advice about running fstrim once per week applies to my workloads.
    
    We depend on discard. There have been a few cases where a few collections of servers ran without discard and suffered both from excessive wear and perf stalls because write-amp was much larger than needed. This was fixed by mounting with discard. Using fstrim wasn’t considered, so I can’t compare it and I realize this discussion is about fstrim vs discard. So far I am making the case for TRIM whether that is via discard or fstrim.
    
    The original workload I cared about was InnoDB and unlink was infrequent, so unlink performance wasn’t a priority. I have been spending more time with LSMs, mostly RocksDB, over the past few years and that unlinks files all of the time but in background threads so unlink perf hasn’t been a priority.
    
    From test databases I am using right now with XFS + discard and FusionIO. I timed ‘rm -rf’, but didn’t count the number of database files prior to this. The directories were from different products so the number of files varied greatly:
    * 683gb -> 28 seconds
    * 518gb -> 40 seconds
    * 2tb -> 68 seconds
    * 514gb -> 30 seconds
    
    Reply
    - 4 kevinclosson July 20, 2015 at 9:01 am
      
      Hi Mark,
      
      Thanks for stopping by. I do suppose the main issue is whether a) folks find themselves routinely creating file systems or b) have an application that is sensitive to file deletions.
      
      Just out of curiosity, can you say a word about your XFS file system where you deleted 2TB of files? That must have been LVM stitching together a few FusionIO cards, no?
      
      Reply
      - 5 markcallaghan (@markcallaghan) July 21, 2015 at 8:37 am
        
        FusionIO ioScale with more than 3T after formatting. No LVM.
        
        Reply
        
        6 kevinclosson July 21, 2015 at 11:34 am
        
        Ah! I was unaware they had such large capacity cards. Do they still put FTL burden on the host?
        
        Reply
        
        7 markcallaghan (@markcallaghan) July 23, 2015 at 7:09 am
        
        yes
        
        Reply
    - 8 Jared D. Cottrell (@jaredcottrell) September 29, 2016 at 1:21 pm
      
      I’m curious what you guys think of the dilemma we find ourselves in with XFS, EXT4, fstrim, and discard using MongoDB with the WiredTiger storage engine on SSDs:
      
      https://groups.google.com/forum/#!topic/mongodb-user/Mj0x6m-02Ms/discussion
      
      In particular, Mark, I’d be curious what you think of running XFS with discard given the comments we found from XFS developers themselves with dire warnings of corrupt data if used.
      
      Reply
9 markcallaghan (@markcallaghan) July 23, 2015 at 7:17 am

Wasn’t willing to wait for your post about XFS. I try to avoid the ext family for database workloads (no concurrent writes, stalls from sync after write-append, etc). I am testing MongoDB with different storage engines — WiredTiger, RocksDB, mmapv1 and TokuMX. The server has SW RAID 0 over 15 disks and XFS with 2MB stripes. At test end I do ‘rm -r data; mkdir data’ and in a few cases that has been incredibly slow but only for mmapv1 database. Now, mmapv1 gets a larger database because it doesn’t use compression, gets bad b-tree fragmentation and does “power-of-2” space allocation. So the database is ~1.5T versus ~500gb on the other hosts. Via top I see lots of CPU for the rm command and via perf I see the top N sources of CPU. This is XFS fragmentation. With mmapv1, MongoDB stores the database across files of size 2G. From past experience, MongoDB+mmapv1 uses posix_fallocate to make the database files 2G before using them — http://smalldatum.blogspot.com/2014/03/redo-logs-in-mongodb-and-innodb.html

6.88% rm [kernel.kallsyms] [k] _xfs_buf_find
5.35% rm [kernel.kallsyms] [k] memcpy
4.56% rm [kernel.kallsyms] [k] xfs_log_commit_cil
3.56% rm [kernel.kallsyms] [k] xfs_trans_buf_item_match
3.31% rm [kernel.kallsyms] [k] memset
3.26% rm [kernel.kallsyms] [k] xfs_btree_lookup
2.81% rm [kernel.kallsyms] [k] xfs_extent_busy_insert
2.62% rm [kernel.kallsyms] [k] xfs_next_bit
2.28% rm [kernel.kallsyms] [k] kmem_cache_alloc

Reply
- 10 kevinclosson July 23, 2015 at 8:48 am
  
  Thanks for that detail, Mark. However, I can’t tell if this feedback is additive or to refute any of the findings I’m sharing though.
  
  Reply
  - 11 markcallaghan (@markcallaghan) July 28, 2015 at 6:38 am
    
    I think it is orthogonal — just one more thing to worry about when doing storage-intensive work with XFS. Fragmentation happens in many places — b-trees, SSDs, filesystems, memory allocators.
    
    Reply
    - 12 kevinclosson July 28, 2015 at 7:36 am
      
      Thanks, Mark.
      
      Reply

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage