This is the third installment on this thread. For context, please see:
Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part I
Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part II
What About cp8M Versus Stock cp(1) with Non-forcedirectio?
That is a good question. The saga continues after my post about copying files on Solaris. Once again, Padraig O’Sullivan was kind enough to test cp8M (available here ) versus stock cp(1) using a normal mounted UFS (non-forcedirectio). He reports:
Ok, I ran the benchmark in the same manner as before WITHOUT forcedirectio i.e. I rebooted the machine before each copy of the file.
# time /usr/bin/cp large_file large_file.1
real 2m17.504s
user 0m0.002s
sys 0m14.363s
#
# time /usr/bin/cp8m large_file large_file.
real 1m56.217s
user 0m0.003s
sys 0m14.264s
#
Why?
I don’t know. I certainly did not expect an increase in kernel mode cycles for the mmap-enabled cp(1). Please refer to Part I in this series to see that the comparison here is between 14.363s versus 10.853s of kernel-mode cycles. We’re not talking a little increase. No, what Padraig’s measurements show is an increase of 32% in kernel mode cycles when copying a 1000MB file using stock cp(1) on a regular mount compared to the same work on a file in a forcedirectio mount. But hey, at least the performance (in MB/s) was consistently 16% less than cp8M. Yes, that was sarcasm.
I haven’t yet gotten my head around why the standard mmap-enabled cp(1) suffers such a jump in kernel mode processor overhead when switching from direct I/O to a normal UFS mount. I need to think about that a bit.
As usual, a picture speaks a thousand words, so I’ll provide two:
Remember my rant about the “small test?”
Sharing, and Caring
There was a comment by a reader on Part I of this blog thread that is worthy of discussion. The reader commented:
Perhaps a fairly obvious statement this, but notice the use of MAP_SHARED on the mmap call? – (I suspect you’ve spotted that already). This means that multiple processes can attach to the same memory mapped file simultaneously.
That is a good blog comment and evidence of someone giving it some thought. But, I’d like to comment on the sharable aspect of the 8MB map the reader brought out. I replied:
Your point about the kernel bcopy from UFS read buffers to the heap buffer in the address space of the cp(1) process is a good one, but this is a forcedirectio case.
That means there is no copy from the page cache into the virtual address space of process since it is direct I/O. My reply continued:
Since this is an Oracle blog, I would naturally go with the forcedirectio comparison first. It will be interesting to see with a normal UFS mount.
I’ve got a $2 bet that the MAP_SHARED is only there to facilitate copying an already mmapped file…the odds of a process jumping in and sharing a 8MB map that only lives for the duration of an I/O in and an I/O out seems pretty slim to me…but then that is 8MB twice…hmmm…I guess that 8MB mmap could exist for as much as 2-3 seconds if the I/O is headed for a single, simple drive. Sounds like a race just to share an 8MB map to me.
A Closer Look
Yes, when the stock cp(1) mmaps each 8MB segment of the input file it does so with MAP_SHARED. Like I said, I suspect the only thought behind that flag usage was to ensure there wouldn’t be “twinkling” mmap failures by other processes that could potentially be mmapping parts of that file while cp(1) is walking through it.
The reader’s comment continued:
That’s not to say that they all need to be “cp”’s – anything using mmap() on the same file at about same time will yield a benefit – the 8MB chunk paged in by mmap should only be later reclaimed by the pagescanner (or when the last process detaches?).
I already discussed the odds of another process getting in there and benefiting by that very transient mmap. It is 8MB in size and only valid during the read in, and write out—about 2-3 seconds on a really slow disk subsystem.
Reclaims
What’s this about reclaims? Good topic. When the mmap is dissolved through munmap(), the pages of the file are put on the free list (pagecache). Here is where the non-forcedirectio cp8M and cp(1) have a lot in common. In both cases, the blocks from the input file remain in main memory. Now that is where there is some true opportunity for sharing but only in the non-forcedirectio case. All said, it doesn’t take mmap() to get sharing of file contents being copied when you are using UFS with a normal mount.
So the question remains, what’s up with the mmap()-enabled cp(1)?
Is this thread making you sleepy?
Just posted a reply to your last entry, and then noticed, you’ve beat me to it with a new entry 🙂
> That means there is no copy from the page cache into the virtual address space of process since it is direct I/O.
Can you clarify what you mean there? – You are correct that there is no copy from the page cache (because the page cache is not used for directio), however, the data still needs to be copied from the kernel read buffer to the userland process (specifically, from the directio_buf_cache to the heap of the process). The kernel cannot write directly to the userland buffer.
Forgot to say, on the Solaris “cp” with file sizes less than 32K, it’ll do a normal read(). Doesn’t bring us any closer to the reasoning behind the mmap, however it does suggest that there is a specific case that they were tuning for.
Mike,
With direct I/O the DMA is done directly from disk into user address space buffer. I don’t know how to say it any simpler than that–and I’m not trying to sound harsh. There are kernel memory **structures** associated with the direct I/O for sure, but those are just status/state structures, not to be confused with buffers.
Kevin,
I’ve run a couple of tests using cp8m on Solaris 8 on a UFS filesystem without forcedirectio and see great results. When copying a 8GB file, I’m getting about 18 MB/s using the stock cp and about 28 MB/s using cp8m (built from coreutils-6.9). Copying a terabyte just got a little easier :).
$ time cp /u02/oradata/stage01/data_l01.dbf /dbbk01/u02/oradata/stage01
real 7m17.41s
user 0m0.18s
sys 2m32.55s
$ time ~/bin/cp8m /u02/oradata/stage01/data_l01.dbf /dbbk01/u02/oradata/stage01
real 4m49.59s
user 0m0.02s
sys 3m4.50s
Many thanks for your blog in general and specifically for the entries about cp8m.
where can I get cp8m ?