In my recent blog post entitled Standard File Utilities with Direct I/O, I covered the concept of using direct I/O filesystems for storing files to eliminate the overhead of caching them. Consider such files as archived redo. It makes no sense polluting memory with spooled archived redo logs. Likewise, if you compress archived redo it makes little sense to have those disk blocks hanging out in main memory. However, the thread discusses the fact that most operating system file tools do their work by issuing ridiculously small I/O requests. If you haven’t yet, I encourage you to read that blog entry.
The blog entry seeded a lively thread of comments—some touched on theory, others were evidence of entirely missing the point. One excellent comment came in that refreshed some long-lost memories of Solaris. The reader wrote:
For what its worth, cp on Solaris 10 uses mmap64 in 8MB chunks to read in data and 8MB write()s to write it out.
I did in fact know that but it had been a while since I have played around on Solaris.
The Smell Test
I wasn’t fortunate enough to have genius passed to me through genetics. Oh how it seems life would be simpler if that were the case. But it isn’t. So, my motto is, “99% Perspiration, 1% Inspiration.” To that end, I learned early on in my career to develop the skills needed to spot something that cannot be correct—before bothering myself with whether or not it is in fact correct. There is a subtle difference. As soon as the Solaris mmap() enabled cp(1) thing cropped up, I spent next to no time at all pondering how that must certainly be better than normal reads (e.g., pread()) since it failed my smell test.
Ponder Before We Measure
What did I smell? There is just no way that walking through the input file by mapping, unmapping and remapping 8MB at a time could be faster than simply reusing a heap buffer. No way at all. After all, mmap() has to make VM adjustments that are not exactly cheap so taxing every trip to disk with a vigorous jolt of VM overhead makes little sense.
There must have been some point in time when a cp(1) implemented with mmap() was faster, but I suspect that was long ago. For instance, perhaps back in the day before pread(1)/pwrite(). Before these calls, positioning and reading a file required 2 kernel dives (one to seek and the other to read). Uh, but hold it. We are talking about cp(1) here—not random reads—where each successful read on the input file automatically adjusts the file pointer. That is, the input work loop would never have been encumbered with a pair of seek and read. Hmmm. Anyway, we can guess all day long why the Solaris folks chose to have cp(1) use mmap(2) as its input work horse, but in the end we’ll likely never know.
A Closer Look
In the following truss output, the Solaris cp(1) is copying a 5.8MB file to an output file called “xxx.” After getting a file descriptor for the input file, the output file is created. Next, mmap() is used on the input file (reading all 5.8MB since it is smaller than the 8MB operation limit). Next, the write call is used to write all 6161922 bytes from the mmap()ed region out to the output file (fd 4).
open64("SYSLOG-4", O_RDONLY) = 3
creat64("xxx", 0777) = 4
stat64("xxx", 0x00028640) = 0
fstat64(3, 0x000286D8) = 0
mmap64(0x00000000, 6161922, PROT_READ, MAP_SHARED, 3, 0) = 0xFEC00000
write(4, " F e b 5 1 0 : 3 6".., 6161922) = 6161922
munmap(0xFEC00000, 6161922) = 0
Of course if the file happened to be larger than 8MB, cp(1) would unmap and then remap the next chunk and on it would proceed in a loop until the input EOF is reached. That is a lot more “moving parts” than simply calling read(2) over and over clobbering the contents of a buffer allocated at the onset of the copy operation—without continual agitation of the VM subsystem with mmap().
I couldn’t imagine how cp(1) using mmap() would be any faster than read(2)/write(2). But then, it actually only replaces the input read with an mmap() while using write(2) on the output side. I couldn’t imagine how replacing just the input portion with mmap() would be faster than a cp() that uses a static heap buffer with read/write pairs. Moreover, I couldn’t picture how the mmap() approach would be easier on resources.
Measure Before Blogging
Not exactly me since I don’t have much Solaris gear around here. I asked Padraig O’Sullivan to compare the stock cp(1) of Solaris 10 to a GNU cp(1) with the modification I discuss in this blog entry. The goal at hand was to test whether the stock cp(1) constantly mapping and unmapping the input file is somehow faster or more gentle on processor cycles than starting out with a heap buffer and reusing it. The latter is exactly what GNU cp(1) does of course. Padraig asked:
One question about the benchmark you want me to run (I want to make sure I get all the data you want) – is this strategy ok?
1. Mount a UFS filesystem with the forcedirectio option
2. Create a 1 GB file on this filesystem
3. Copy the file using the standard cp(1) utility and record timing statistics
4. Copy the file using your modified cp8M utility and record timing statisticsLet me know if this is ok or if you want more information for the benchmark.
There was something else. I wanted a fresh system reboot just prior to each copy operation to make sure there were no variables. Padraig had the following to report:
[…] manage to run the benchmark in the manner you requested this morning […] Below is the results. I rebooted the machine before performing the copy each time.
# ls -l large_file
-rw-r–r– 1 root root 1048576000 Mar 5 10:25 large_file
# time /usr/bin/cp large_file large_file.1
real 2m17.894s
user 0m0.001s
sys 0m10.853s
# time /usr/bin/cp8m large_file large_file.2
real 1m57.932s
user 0m0.002s
sys 0m8.057s
Look, I’m just an old Sequent hack and yet the results didn’t surprise me. The throughput increased roughly 16% from 7.3MB/s to 8.5MB/s. What about resources? The tweaked GNU cp8M utilized roughly 26% less kernel mode processor cycles to do the same task. That’s not trivial since we didn’t actually eliminate any I/O. What’s that? Yes, cp8M reduces the wall clock time it takes to copy a 1000MB file by 16%–without eliminating any physical I/O!
Controversy
Yes, blogging is supposed to be a bit controversial. It cultivates critical thought and the occasional gadfly, fly-by commentary I so dearly appreciate. Here we have a simple test that shows a “normal” cp(1) is slightly faster and significantly lighter on resources than the magical mmap()-enabled cp(1) that ships with Solaris. Does that mean there isn’t a case where the mmap() style is better? I don’t know. It could be that some really large Solaris box with dozens of concurrent cp(1) operations would show the value of the mmap() approach. If so, let us know.
What Next?
I hope someone will volunteer to test cp8M on a high-end Solaris system. Maybe the results will help us understand why cp(1) uses mmap() on Solaris for its input file. Maybe not. Guess which way I’d wager.
Perhaps a fairly obvious statement this, but notice the use of MAP_SHARED on the mmap call? – (I suspect you’ve spotted that already). This means that multiple processes can attach to the same memory mapped file simultaneously.
That’s not to say that they all need to be “cp”‘s – anything using mmap() on the same file at about same time will yield a benefit – the 8MB chunk paged in by mmap should only be later reclaimed by the pagescanner (or when the last process detaches?).
Also – I think the mmap() method has the potential for using less memory (each process doesn’t necessarily need to have a large heap buffer) – probably not a concern these days, but then, I suspect that “cp” was written an awful long time ago.
Part of the motivation behind implementing memory mapped I/O was to avoid the double-buffering associated with read() calls.. (read from hardware into kernel space, then copy the data to the buffer to userspace) – it does, of course come with an associated cost (increased memory management overhead), but then, that’s the trade-off, isn’t it..?
There may be some cases where mmap() is more efficient, but I think it would be difficult to suggest that there isn’t any cases where read() is best.
Mike,
Your point about the kernel bcopy from UFS read buffers to the heap buffer in the address space of the cp(1) process is a good one, but this is a forcedirectio case. I’ve got some readers comparing on normal mounts and the results could indeed be lot different in that case.
Since this is an Oracle blog, I would naturally go with the forecedirectio comparison first. It will be interesting to see with a normal UFS mount.
I’ve got a $2 bet that the MAP_SHARED is only there to facilitate copying an already mmapped file…the odds of a process jumping in and sharing a 8MB map that only lives for the duration of an I/O in and an I/O out seems pretty slim to me…but then that is 8MB twice…hmmm…I guess that 8MB mmap could exist for as much as 2-3 seconds if the I/O is headed for a single, simple drive. Sounds like a race just to share an 8MB map to me.
> but this is a forcedirectio case
Apologies if I’m misinterpreting your response here, but I don’t think this matters .. I wasn’t talking about the page cache (which would be effectively disabled on a forcedirectio filesysem), I meant the kernel read buffer – the kernel won’t/can’t write directly to the userspace buffer, so when it recieves the data in from the VFS module (in this case, UFS), it needs to store it somewhere (directio_buf_cache kmem).
As I understand it, when working *without* directio, it’ll also pass through the page cache, which will improve some kinds of I/O, but that’s an aside that I wasn’t really considering here.
Using read() and directio implies a very long, single-threaded operation (user process calls read(), kernel requests data from vfs, waits for result, copyout()’s the results back to the userland process).
Memory mapped I/O (even on a directio filesystem) suggests a more “multithreaded” approach, because the memory management system will be paging in the data (mmmm – would it prefetch on a directio filesystem?), while the user thread is getting on with its’ work (although this is a pretty simplistic example, because “cp” is immediately write()’ing the full buffer size out again).
*disclaimer – I fully reserve the right to be mistaken on any of the above – this is all based on my understanding, and happy to be otherwise educated 😉
ls -l allcp-test.dat
-rw-r–r– 1 root root 5287847424 Jul 4 13:08 allcp-test.dat
#time /opt/sfw/bin/cp allcp-test.dat /export/home/sunteam/
real 3m20.497s
user 0m1.099s
sys 0m56.181s
(cp is from gnu using read/write method)
#time /usr/bin/cp allcp-test.dat /export/home/sunteam/
real 2m11.940s
user 0m0.004s
sys 0m26.583s
(cp is from solaris internal using mmap/munmap method)
Try in /usr/bin/ksh93 on Opensolaris:
builtin cp
and benchmark that. This is the AST/ksh93 cp command and should run much faster.
Olga
the mmap method might be better on a machine with 16MB of physical RAM. Once you call write(), it lets the kernel decide how much RAM (up to 8MB) to actually give the job.
Sir! I wonder, if there was a reboot IN BETWEEN the two timings were taken?.. Because it is possible, that the first copying — the loser — brought all/most of the source file into cache, thus allowing the second copying (the winner) to get the data faster…
My own comparing the stock SunOS-5.8 cp with gcp here yields:
% gcp –version
cp (GNU fileutils) 4.0
% ls -l /src/autotree.tar.gz
-rw-r–r– 1 root staff 18110684 Sep 11 2003 /src/autotree.tar.gz
% time gcp /src/autotree.tar.gz 1
0.00u 0.27s 0:00.31 87.0%
% time cp /src/autotree.tar.gz 2
0.00u 0.13s 0:00.13 100.0%
% time gcp /src/autotree.tar.gz 3
0.00u 0.19s 0:00.19 100.0%
% time cp /src/autotree.tar.gz 4
0.00u 0.14s 0:00.14 100.0%
Which shows that a) order matters greately; b) the mmap-ing cp wins notably over gcp even when the source is already in cache…
BTW, truss reveals, that — at least, here — mmap is used on much smaller chunks. Mere 262144 bytes are mmap-ed and madvised (MAP_SEQUENTIONAL) at a time — not 8Mb. Perhaps, cp determines that based on some filesystem parameter…
The gcp — according to truss — reads/writes 8192 bytes at a time.
Hello Miwa,
No reboots.
Well, without reboots — and without repetitions — the tests aren’t valid… I did not reboot either, but I repeated the tests several times — and Solaris’ cp wins over gcp hands down. At least, when the source file is already in cache.
If you think I’m wrong, and Glenn Fawcett is wrong and Padraig O’Sullivan is wrong then fine. But please tell me what gcp has to do with anything? Did you modify it as I specified?
There seem to be too many differences between yours and Kevin’s tests to make any real valid comparisons. You are using gcp and Kevin is using cp8M that is referenced in Padraig’s blog. The HW is not listed in either case and the chip cache architecture can make a difference as well. Finally, the OS versions are likely different since you are commenting 4 year old blog entry… You must have more time on your hands than I do 🙂
I am sure that you are seeing behavior valid in your environment, but this in no way invalidates what Kevin was seeing back in 2007 in his environment.
The utility in use (gcp vs. cp8m — whatever it is) is of no account. What matters is that mmap/write wins over read/write.
The differences in hardware matter little, as long as the comparison itself is performed on the same box. The OS (SunOS-5.8) is over a decade old, which dwarfs the age of this thread.
Bottom line is, Kevin’s very methodology in 2007 was wrong. If only two tests are taken — as was the case — the second test will “win” just because the data is already in the OS’ cache…
I make no accusations, but wish to correct the record for posterity. I found this blog, while researching the “read vs. mmap” question — and tried to reproduce the results described here…
Miwa,
I wasn’t 9 years old when I posted the thread in question so it most certainly isn’t as elementary as a cache versus non-cache issue. I’ve let your comments through though. If you ever get around to actually testing what the blog post is about (cp8M), let us know. Otherwise, you’ve convinced us that you believe you are more skilled on the topic so I think the thread is about dead.
Please see the following regarding fresh-booted results:
I’ll be happy to re-test with cp8m — just give me the (link to) executable.
But, again, if there were no reboots between the two copyings described in the original post, then the second one won (at least, partially) due to the source data being in cache…
Miwa,
I appreciate your passion. If, however, you care to put that passion to use for such an old topic then I recommend you repeat the heavy lifting done by Padraig O’Sullivan and Glenn Fawcett. Start here:
http://posulliv.github.com/2008/11/29/building-a-modified-cp-binary-on-solaris-10.html
When the copy.c is thus modified, the cp-utility crashes. Most likely, because the memory is then allocated on-stack (using alloca()), rather than on heap (using malloc()). So I changed the code to use malloc — though alloca is faster, the difference is insignificant in this case, as the allocation only happens once anyway.
The results. On Solaris-8 using 40Mb file:
% time src/cp big.bcp ~/1
0.00u 2.35s 0:11.38 20.6%
% time cp big.bcp ~/1
0.01u 3.65s 0:03.80 96.3%
% time src/cp big.bcp ~/1
0.01u 1.80s 0:04.09 44.2%
% time cp big.bcp ~/1
0.00u 2.89s 0:02.90 99.6%
% time src/cp big.bcp ~/1
0.01u 1.78s 0:03.97 45.0%
% time cp big.bcp ~/1
0.01u 2.95s 0:02.97 99.6%
Although the total time goes down after the first two runs, the stock cp wins repeatedly over the hacked-up src/cp.
On Solaris-10, using a 5Gb file, the picture is slightly different (
% time /tmp/coreutils-5.2.1/src/cp big /tmp/z
0.02u 78.33s 2:16.51 57.3%
% time /tmp/coreutils-5.2.1/src/cp big /tmp/z
% time cp big /tmp/z
0.02u 94.78s 2:23.34 66.1%
% time cp big /tmp/z
0.01u 89.34s 2:29.74 59.6%
% time /tmp/coreutils-5.2.1/src/cp big /tmp/z
0.02u 90.50s 2:23.89 62.9%
% time cp big /tmp/z
0.01u 80.09s 2:04.56 64.3%
% time /tmp/coreutils-5.2.1/src/cp big /tmp/z
Here the two methods seem evenly matched — any differences between them are well within the differences between the runs of the same method.
I am unable to reproduce your 2007 results and remain unconvinced about the methodology used back then… However, both of my machines have only a single local disk (though the actual hardware is a two-disk mirror on both), so it is possible, that playing the source and the target of the copying on distinct physical devices could’ve produced different results.
Could you elaborate on why you think, the caching did not affect your tests back then? A 1Gb file could’ve be cached entirely on a freshly rebooted machine with no users (and thus plenty of idle RAM)…
No, I’m sorry, the thread is too old for me to take the time to revisit.
Oops, a problem copy-pasting above. The first four of the Solaris-10’s timings should be:
% time /tmp/coreutils-5.2.1/src/cp big /tmp/z
0.02u 78.33s 2:16.51 57.3%
% time /tmp/coreutils-5.2.1/src/cp big /tmp/z
0.02u 94.78s 2:23.34 66.1%
% time cp big /tmp/z
0.01u 89.34s 2:29.74 59.6%
% time /tmp/coreutils-5.2.1/src/cp big /tmp/z
0.02u 90.50s 2:23.89 62.9%
I think the reason why they used mmap as opposed to read/write is to leverage cached files copies.
Essentially, if a file is cached, the mmap call will just map to that address space, and read from it.
In the case with “read” it will have to copy from cache to heap space, and then from heap space back to pagecache (since writes are buffered) and then eventually to disk.
Essentially the mmap version saves a memory to memory copy in the case where most of the file is cached.
The directiocase is a “special” case, where this optimization just does not apply, and perhaps gets slightly in the way.
Ideally, cp would use streaming async-io type copy when the file is open in directio either via filesystem mount or command line parameter.
“perhaps gets slightly in the way”
…Christo, there is nothing to mmap with directio…so your first supposition is correct (does not apply).
On solaris, mmap pollutes the page cache when doing i/o even when mounted forcedirectio
Yes, Andy, of course you are right. My last comment (to Christo) was off base because I totally forgot that the “directio” I was replying about in that context was Sol forcedirectio mount option. I had linux directio in mind when I typed that comment. That’s what I get for not closing down comments on a post that is 4 years old…
So, yes, with forcedirectio mount option a mmap call still has to use page cache because that’s what mmap does. I’m actually surprised that Sol supports mmap on a forcedirectio mount, but that is another topic. Not that I think there is anything wrong with supporting mmap on a forcedirectio mounted FS, I’m just surprised it supports it. I really back in my Veritas days that this has not always been the case (mmap support on forecedirectio) but that was over a decade ago so my recollection is spotty on the matter.