Archive for the 'Solaris I/O Performance' Category

Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part III

This is the third installment on this thread. For context, please see:
Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part I
Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part II

What About cp8M Versus Stock cp(1) with Non-forcedirectio?
That is a good question. The saga continues after my post about copying files on Solaris. Once again, Padraig O’Sullivan was kind enough to test cp8M (available here ) versus stock cp(1) using a normal mounted UFS (non-forcedirectio). He reports:

Ok, I ran the benchmark in the same manner as before WITHOUT forcedirectio i.e. I rebooted the machine before each copy of the file.

# time /usr/bin/cp large_file large_file.1
real 2m17.504s
user 0m0.002s
sys 0m14.363s
#
# time /usr/bin/cp8m large_file large_file.
real 1m56.217s
user 0m0.003s
sys 0m14.264s
#

Why?
I don’t know. I certainly did not expect an increase in kernel mode cycles for the mmap-enabled cp(1). Please refer to Part I in this series to see that the comparison here is between 14.363s versus 10.853s of kernel-mode cycles.  We’re not talking a little increase. No, what Padraig’s measurements show is an increase of 32% in kernel mode cycles when copying a 1000MB file using stock cp(1) on a regular mount compared to the same work on a file in a forcedirectio mount. But hey, at least the performance (in MB/s) was consistently 16% less than cp8M. Yes, that was sarcasm.

I haven’t yet gotten my head around why the standard mmap-enabled cp(1) suffers such a jump in kernel mode processor overhead when switching from direct I/O to a normal UFS mount. I need to think about that a bit.

As usual, a picture speaks a thousand words, so I’ll provide two:

cp8m1.jpg

cp8m2.jpg

Remember my rant about the “small test?”

Sharing, and Caring
There was a comment by a reader on Part I of this blog thread that is worthy of discussion. The reader commented:

Perhaps a fairly obvious statement this, but notice the use of MAP_SHARED on the mmap call? – (I suspect you’ve spotted that already). This means that multiple processes can attach to the same memory mapped file simultaneously.

That is a good blog comment and evidence of someone giving it some thought. But, I’d like to comment on the sharable aspect of the 8MB map the reader brought out. I replied:

Your point about the kernel bcopy from UFS read buffers to the heap buffer in the address space of the cp(1) process is a good one, but this is a forcedirectio case.

That means there is no copy from the page cache into the virtual address space of process since it is direct I/O. My reply continued:

Since this is an Oracle blog, I would naturally go with the forcedirectio comparison first. It will be interesting to see with a normal UFS mount.

I’ve got a $2 bet that the MAP_SHARED is only there to facilitate copying an already mmapped file…the odds of a process jumping in and sharing a 8MB map that only lives for the duration of an I/O in and an I/O out seems pretty slim to me…but then that is 8MB twice…hmmm…I guess that 8MB mmap could exist for as much as 2-3 seconds if the I/O is headed for a single, simple drive. Sounds like a race just to share an 8MB map to me.

A Closer Look
Yes, when the stock cp(1) mmaps each 8MB segment of the input file it does so with MAP_SHARED. Like I said, I suspect the only thought behind that flag usage was to ensure there wouldn’t be “twinkling” mmap failures by other processes that could potentially be mmapping parts of that file while cp(1) is walking through it.

The reader’s comment continued:

That’s not to say that they all need to be “cp”’s – anything using mmap() on the same file at about same time will yield a benefit – the 8MB chunk paged in by mmap should only be later reclaimed by the pagescanner (or when the last process detaches?).

I already discussed the odds of another process getting in there and benefiting by that very transient mmap. It is 8MB in size and only valid during the read in, and write out—about 2-3 seconds on a really slow disk subsystem.

Reclaims
What’s this about reclaims? Good topic. When the mmap is dissolved through munmap(), the pages of the file are put on the free list (pagecache). Here is where the non-forcedirectio cp8M and cp(1) have a lot in common. In both cases, the blocks from the input file remain in main memory. Now that is where there is some true opportunity for sharing but only in the non-forcedirectio case. All said, it doesn’t take mmap() to get sharing of file contents being copied when you are using UFS with a normal mount.

So the question remains, what’s up with the mmap()-enabled cp(1)?

Is this thread making you sleepy?

Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part II

 

In my post about copying files on Solaris, I cover a modification that to GNU cp(1) that yields substantial performance improvements over the stock cp(1) when copying files on forcedirectio mounts. See my comments at the end of that post regarding non-forcedirectio mounts.

I’ve gotten a large number of requests for that code. I don’t have Solaris gear around here so I asked Padraig O’Sullivan to do the favor of testing the modified GNU cp(1) versus stock cp(1). Padraig states:

Here is his recipe for making a cp8M for solaris:

I used coreutils version 5.2.1 which can be obtained from here:

http://ftp.gnu.org/pub/gnu/coreutils/

In the coreutils-5.2.1/src directory I modified the copy.c file at line 287 with the following modification:

# diff -b coreutils-5.2.1/src/copy.c cp8m/copy.c

287c315

<   buf_size = ST_BLKSIZE (sb);

>   /* buf_size = ST_BLKSIZE (sb);*/

288a317,319

>      buf_size = 8388608 ;

I then used the Makefile which is supplied with coreutils to build the cp binary.

Motives
The intent of this whole blog thread is really nothing more than diving into filesystem, and to some degree VM, internals topics so I truly hope that others will do this test. It would be interesting to see whether the mmap-enabled stock cp(1) is better than cp8M with, say, normal UFS mounts. The difference is clear to me on forcedirectio and that is, naturally, the type of mount I would take the most interest in.

 

Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part I

In my recent blog post entitled Standard File Utilities with Direct I/O, I covered the concept of using direct I/O filesystems for storing files to eliminate the overhead of caching them. Consider such files as archived redo. It makes no sense polluting memory with spooled archived redo logs. Likewise, if you compress archived redo it makes little sense to have those disk blocks hanging out in main memory. However, the thread discusses the fact that most operating system file tools do their work by issuing ridiculously small I/O requests. If you haven’t yet, I encourage you to read that blog entry.

The blog entry seeded a lively thread of comments—some touched on theory, others were evidence of entirely missing the point. One excellent comment came in that refreshed some long-lost memories of Solaris. The reader wrote:

For what its worth, cp on Solaris 10 uses mmap64 in 8MB chunks to read in data and 8MB write()s to write it out.

I did in fact know that but it had been a while since I have played around on Solaris.

The Smell Test
I wasn’t fortunate enough to have genius passed to me through genetics. Oh how it seems life would be simpler if that were the case. But it isn’t. So, my motto is, “99% Perspiration, 1% Inspiration.” To that end, I learned early on in my career to develop the skills needed to spot something that cannot be correct—before bothering myself with whether or not it is in fact correct. There is a subtle difference. As soon as the Solaris mmap() enabled cp(1) thing cropped up, I spent next to no time at all pondering how that must certainly be better than normal reads (e.g., pread()) since it failed my smell test.

Ponder Before We Measure
What did I smell? There is just no way that walking through the input file by mapping, unmapping and remapping 8MB at a time could be faster than simply reusing a heap buffer. No way at all. After all, mmap() has to make VM adjustments that are not exactly cheap so taxing every trip to disk with a vigorous jolt of VM overhead makes little sense.

There must have been some point in time when a cp(1) implemented with mmap() was faster, but I suspect that was long ago. For instance, perhaps back in the day before pread(1)/pwrite(). Before these calls, positioning and reading a file required 2 kernel dives (one to seek and the other to read). Uh, but hold it. We are talking about cp(1) here—not random reads—where each successful read on the input file automatically adjusts the file pointer. That is, the input work loop would never have been encumbered with a pair of seek and read. Hmmm. Anyway, we can guess all day long why the Solaris folks chose to have cp(1) use mmap(2) as its input work horse, but in the end we’ll likely never know.

A Closer Look
In the following truss output, the Solaris cp(1) is copying a 5.8MB file to an output file called “xxx.” After getting a file descriptor for the input file, the output file is created. Next, mmap() is used on the input file (reading all 5.8MB since it is smaller than the 8MB operation limit). Next, the write call is used to write all 6161922 bytes from the mmap()ed region out to the output file (fd 4).

open64("SYSLOG-4", O_RDONLY) = 3
creat64("xxx", 0777) = 4
stat64("xxx", 0x00028640) = 0
fstat64(3, 0x000286D8) = 0
mmap64(0x00000000, 6161922, PROT_READ, MAP_SHARED, 3, 0) = 0xFEC00000
write(4, " F e b 5 1 0 : 3 6".., 6161922) = 6161922
munmap(0xFEC00000, 6161922) = 0

Of course if the file happened to be larger than 8MB, cp(1) would unmap and then remap the next chunk and on it would proceed in a loop until the input EOF is reached. That is a lot more “moving parts” than simply calling read(2) over and over clobbering the contents of a buffer allocated at the onset of the copy operation—without continual agitation of the VM subsystem with mmap().

I couldn’t imagine how cp(1) using mmap() would be any faster than read(2)/write(2). But then, it actually only replaces the input read with an mmap() while using write(2) on the output side. I couldn’t imagine how replacing just the input portion with mmap() would be faster than a cp() that uses a static heap buffer with read/write pairs. Moreover, I couldn’t picture how the mmap() approach would be easier on resources.

Measure Before Blogging
Not exactly me since I don’t have much Solaris gear around here. I asked Padraig O’Sullivan to compare the stock cp(1) of Solaris 10 to a GNU cp(1) with the modification I discuss in this blog entry. The goal at hand was to test whether the stock cp(1) constantly mapping and unmapping the input file is somehow faster or more gentle on processor cycles than starting out with a heap buffer and reusing it. The latter is exactly what GNU cp(1) does of course. Padraig asked:

One question about the benchmark you want me to run (I want to make sure I get all the data you want) – is this strategy ok?

1. Mount a UFS filesystem with the forcedirectio option
2. Create a 1 GB file on this filesystem
3. Copy the file using the standard cp(1) utility and record timing statistics
4. Copy the file using your modified cp8M utility and record timing statistics

Let me know if this is ok or if you want more information for the benchmark.

 

There was something else. I wanted a fresh system reboot just prior to each copy operation to make sure there were no variables. Padraig had the following to report:

[…] manage to run the benchmark in the manner you requested this morning […] Below is the results. I rebooted the machine before performing the copy each time.

# ls -l large_file
-rw-r–r– 1 root root 1048576000 Mar 5 10:25 large_file

# time /usr/bin/cp large_file large_file.1

real 2m17.894s
user 0m0.001s
sys 0m10.853s

# time /usr/bin/cp8m large_file large_file.2

real 1m57.932s
user 0m0.002s
sys 0m8.057s

Look, I’m just an old Sequent hack and yet the results didn’t surprise me. The throughput increased roughly 16% from 7.3MB/s to 8.5MB/s. What about resources? The tweaked GNU cp8M utilized roughly 26% less kernel mode processor cycles to do the same task. That’s not trivial since we didn’t actually eliminate any I/O. What’s that? Yes, cp8M reduces the wall clock time it takes to copy a 1000MB file by 16%–without eliminating any physical I/O!

Controversy
Yes, blogging is supposed to be a bit controversial. It cultivates critical thought and the occasional gadfly, fly-by commentary I so dearly appreciate. Here we have a simple test that shows a “normal” cp(1) is slightly faster and significantly lighter on resources than the magical mmap()-enabled cp(1) that ships with Solaris. Does that mean there isn’t a case where the mmap() style is better? I don’t know. It could be that some really large Solaris box with dozens of concurrent cp(1) operations would show the value of the mmap() approach. If so, let us know.

What Next?
I hope someone will volunteer to test cp8M on a high-end Solaris system. Maybe the results will help us understand why cp(1) uses mmap() on Solaris for its input file. Maybe not. Guess which way I’d wager.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,949 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: