Little Things Doth Crabby Make – Part XVII. I See xfs_mkfile(8) Making Fragmented Files.

BLOG UPDATE 21-NOV-2011: The comment thread for this post is extremely relevant.


I recently had an “exchange of ideas” with an individual. It was this individual’s assertion that modern systems exhibit memory latencies measured in microseconds.

Since I haven’t worked on a system with microsecond-memory since late in the last millennium I sort of let the conversation languish.

The topic of systems speeds and feeds was fresh on my mind from that conversation when I encountered something that motivated me to produce this installment in the Little Things Doth Crabby Make series.

This installment in the series has to do with disk scan throughput and file system fragmentation. But what does that have to do with modern systems’ memory latency? Well, I’ll try to explain.

Even though I haven’t had the displeasure of dealing with microsecond memory, this century, I do recall such ancient systems were routinely fed (and swamped) by just a few hundred megabytes per second disk scan throughput.

I try to keep things like that in perspective when I’m fretting over the loss of 126MB/s like I was the other day. Especially when the 126MB/s is a paltry 13% degradation in the systems I was analyzing! Modern systems are a modern marvel!

But what does any of that have to do with XFS and fragmentation? Please allow me to explain. I had a bit of testing going where 13% (for 126MB/s) did make me crabby (it’s Little Things Doth Crabby Make after all).

The synopsis of the test, and thus the central topic of this post, was:

  1. Create and initialize a 32GB file whilst the server is otherwise idle
  2. Flush the Linux page cache
  3. Use dd(1) to scan the file with 64KB reads — measure performance
  4. Use xfs_bmap(8) to report on file extent allocation and fragmentation

Step number 1 in the test varied the file creation/initialization method between the following three techniques/tools:

  1. xfs_mkfile(8)
  2. dd(1) with 1GB writes (yes, this works if you have sufficient memory)
  3. dd(1) with 64KB writes

The following screen-scrape shows that the xfs_mkfile(8) case rendered a file that delivered scan performance significantly worse than the two dd(1) cases. The degradation was 13%:

# xfs_mkfile 32g testfile
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 524288+0 records in
 524288+0 records out
 34359738368 bytes (34 GB) copied, 40.8091 seconds, 842 MB/s
 # xfs_bmap -v testfile > frag.xfs_mkfile.out 2>&1
 # rm -f testfile
 # dd if=/dev/zero of=testfile bs=1024M count=32
 32+0 records in
 32+0 records out
 34359738368 bytes (34 GB) copied, 22.1434 seconds, 1.6 GB/s
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 524288+0 records in
 524288+0 records out
 34359738368 bytes (34 GB) copied, 35.5057 seconds, 968 MB/s
 # xfs_bmap -v testfile > frag.ddLargeWrites.out 2>&1
 # rm testfile
 # df -h .
 Filesystem Size Used Avail Use% Mounted on
 /dev/sdb 2.7T 373G 2.4T 14% /data1
 # dd if=/dev/zero of=testfile bs=1M count=32678
 32678+0 records in
 32678+0 records out
 34265366528 bytes (34 GB) copied, 21.6339 seconds, 1.6 GB/s
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 522848+0 records in
 522848+0 records out
 34265366528 bytes (34 GB) copied, 35.3932 seconds, 968 MB/s
 # xfs_bmap -v testfile > frag.ddSmallWrites.out 2>&1

I was surprised by the xfs_mkfile(8) case. Let’s take a look at the xfs_bmap(8) output.

First, the two maps from the dd(1) files:

# cat frag.ddSmallWrites.out
 0: [0..9961471]: 1245119816..1255081287 6 (166187576..176149047) 9961472
 1: [9961472..26705919]: 1342791800..1359536247 7 (84037520..100781967) 16744448
 2: [26705920..43450367]: 1480316192..1497060639 8 (41739872..58484319) 16744448
 3: [43450368..66924543]: 1509826928..1533301103 8 (71250608..94724783) 23474176
 # cat frag.ddLargeWrites.out
 0: [0..9928703]: 1245119816..1255048519 6 (166187576..176116279) 9928704
 1: [9928704..26673151]: 1342791800..1359536247 7 (84037520..100781967) 16744448
 2: [26673152..43417599]: 1480316192..1497060639 8 (41739872..58484319) 16744448
 3: [43417600..67108863]: 1509826928..1533518191 8 (71250608..94941871) 23691264

The mapping of file offsets to extents is quite close in the dd(1) file cases. Moreover, XFS gave me 4 extents for my 32GB file. I like that..but…

So what about the xfs_mkfile(8) case? Well, not so good.

I’ll post a blog update when I figure out more about what’s going on. In the meantime, I’ll just paste it and that will be the end of this post for the time being:

# cat frag.xfs_mkfile.out
 0: [0..10239]: 719289592..719299831 4 (1432..11671) 10240
 1: [10240..14335]: 719300664..719304759 4 (12504..16599) 4096
 2: [14336..46591]: 719329072..719361327 4 (40912..73167) 32256
 3: [46592..78847]: 719361840..719394095 4 (73680..105935) 32256
 4: [78848..111103]: 719394608..719426863 4 (106448..138703) 32256
 5: [111104..143359]: 719427376..719459631 4 (139216..171471) 32256
 6: [143360..175615]: 719460144..719492399 4 (171984..204239) 32256
 7: [175616..207871]: 719492912..719525167 4 (204752..237007) 32256
 8: [207872..240127]: 719525680..719557935 4 (237520..269775) 32256
 [...3,964 lines deleted...]
 3972: [51041280..51073535]: 1115787376..1115819631 6 (36855136..36887391) 32256
 3973: [51073536..51083775]: 1115842464..1115852703 6 (36910224..36920463) 10240
 3974: [51083776..51116031]: 1115852912..1115885167 6 (36920672..36952927) 32256
 3975: [51116032..54897663]: 1142259368..1146040999 6 (63327128..67108759) 3781632
 3976: [54897664..55078911]: 1146077440..1146258687 6 (67145200..67326447) 181248
 3977: [55078912..56094207]: 1195607400..1196622695 6 (116675160..117690455) 1015296
 3978: [56094208..67108863]: 1245119816..1256134471 6 (166187576..177202231) 11014656

4 Responses to “Little Things Doth Crabby Make – Part XVII. I See xfs_mkfile(8) Making Fragmented Files.”

  1. 1 Ben Prusinski November 18, 2011 at 4:45 pm

    Thanks Kevin,

    Great post. I look forward to benchmarking XFS versus ASM versus ext3/4 on RHEL and Oracle 11gR2.

    Ben Prusinski

    • 2 kevinclosson November 18, 2011 at 6:25 pm

      Hi Ben,

      Take a peek an the mount options we use for XFS with the EMC Greenplum Data Computing Appliance and use the same for your tests. Also, use that xfs_bmap command on all your datafiles before you run your benchmark and let us know how fragmented the files are when they are created with oracle CCF (create tablespace). Make sure to set filesystemio_options=setall.

      Would this happen to be BIGFILE testing perchance?

  2. 3 Mark Seger November 21, 2011 at 12:45 pm

    a couple of comments/questions…

    Do you use dd because you really like it OR is it just convenient? I was first introduced to disk benchmarking with Robin Miller’s dt (disk test) tool – At least for me it’s far easier/intuitive to use than dd. None of this record vs blocksize counting to figure out how much to read/write. You just give it a filesize and a blocksize and it figured out how many records to read. If you want to repeat a test with different values, you only need to change one. If you want to switch from sequential to random or increase the number of parallel threads, it’s a piece of cake. You can even select asynchronous or direct I/O. It has almost as many options as collectl ;), so you can really fine tune your testing.

    You talked about doing blocksizes of 1GB but are they really? Doesn’t it really depend on what the disk controller allows? Won’t the O/S break it into more bite size chunks – or is that the whole point and you controller does support them? One thing I’ve always found on the flip side is small blocksizes rarely seemed to make a difference because linux would just merge them and still do large block write. I rarely choose anything below 1M because I’ve never really seen a difference and IM is less keystrokes than 32k, in dt-speak.

    Generally whenever I run a test of any kind I run collectl in a window to watch what’s happening. Is the bandwidth relatively smooth? A number of years ago I was doing some measurements creating VERY large xfs directories containing over 1M small files. What I found by running collectl was that at somewhere around 800K or 900K files (it was a long time ago I did this) you could actually see the create rates slow down and the CPU increase significantly – I think it had something to do with scanning internal tables. But the point is if I only looked the wall clock I’d never have spotted that. But then again perhaps you have done this and didn’t see anything odd.

    btw – the correct number is 32768 not 32678 😉 The reason I know that is when I first started programming in BasicPlus on rsts/e, the max line number (yes, it had line numbers!) was 32767.


    • 4 kevinclosson November 21, 2011 at 7:32 pm

      Hi Mark,

      I use dd in this case only to make a point using a tool that is available to everyone–so, yes convenient. I have all my own dd-ish sort of I/O generators that I use in my day job.

      Your points about large size arguments to dd are valid. By the time the request leaves the Linux block layer the driver requests are quite small (<256KB) just as is the case with the relationship between the block I/O layer and your (hp) cciss driver. The only way to get 1MB driver requests is to buffer in hugepages IPC shared memory Or mmaped files in hugetlbfs. However, all the LSI controllers I've had experience with will accept a 1MB (properly buffered) read but *not* a 1MB write. None of that matters though because the extents allocated to a growing XFS file is functionality that occurs *way* upstream from the block I/O layer (the entity that chops up large application requests). Those allocations happen in XFS code and the main ingredients involved are the mount options for XFS and the characteristics of the applications file growth requests (appending write size or negative truncate, etc).

      If you look at the example I've supplied you'll see a dd(1) measurement derived from a file created with 1GB size arg and another with 64KB. The reason I do the 1GB thing is so I don't typo like I did in the 2^15 case (it was a typo not bad math but thanks for the walk down memeory lane 🙂 ). The difference in the file sizes is .3% and you'll notice that Linux dd(1) produces it's measured I/O rate in MB/s Both cases created with dd(1) rendered less fragmentation ( the important point of this post) and equally more throughput than the xfs_mkfile(8) case ( the slightly less important point of this post).

      If any of that sounds like mumbo-jumbo let me know. We can either take it up here or in one of our regular, on-going email threads about collectl.

      BTW, readers, Mark brings up collectl. If you haven't looked at collectl, um, you should! 🙂

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


I work for Amazon Web Services but all of the words on this blog are purely my own. Not a single word on this blog is to be mistaken as originating from any Amazon spokesperson. You are reading my words this webpage and, while I work at Amazon, all of the words on this webpage reflect my own opinions and findings. To put it another way, "I work at Amazon, but this is my own opinion." To conclude, this is not an official Amazon information outlet. There are no words on this blog that should be mistaken as official Amazon messaging. Every single character of text on this blog originates in my head and I am not an Amazon spokesperson.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,894 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: