In my recent blog post entitled Standard File Utilities with Direct I/O, I covered the concept of using direct I/O filesystems for storing files to eliminate the overhead of caching them. Consider such files as archived redo. It makes no sense polluting memory with spooled archived redo logs. Likewise, if you compress archived redo it makes little sense to have those disk blocks hanging out in main memory. However, the thread discusses the fact that most operating system file tools do their work by issuing ridiculously small I/O requests. If you haven’t yet, I encourage you to read that blog entry.
The blog entry seeded a lively thread of comments—some touched on theory, others were evidence of entirely missing the point. One excellent comment came in that refreshed some long-lost memories of Solaris. The reader wrote:
For what its worth, cp on Solaris 10 uses mmap64 in 8MB chunks to read in data and 8MB write()s to write it out.
I did in fact know that but it had been a while since I have played around on Solaris.
The Smell Test
I wasn’t fortunate enough to have genius passed to me through genetics. Oh how it seems life would be simpler if that were the case. But it isn’t. So, my motto is, “99% Perspiration, 1% Inspiration.” To that end, I learned early on in my career to develop the skills needed to spot something that cannot be correct—before bothering myself with whether or not it is in fact correct. There is a subtle difference. As soon as the Solaris mmap() enabled cp(1) thing cropped up, I spent next to no time at all pondering how that must certainly be better than normal reads (e.g., pread()) since it failed my smell test.
Ponder Before We Measure
What did I smell? There is just no way that walking through the input file by mapping, unmapping and remapping 8MB at a time could be faster than simply reusing a heap buffer. No way at all. After all, mmap() has to make VM adjustments that are not exactly cheap so taxing every trip to disk with a vigorous jolt of VM overhead makes little sense.
There must have been some point in time when a cp(1) implemented with mmap() was faster, but I suspect that was long ago. For instance, perhaps back in the day before pread(1)/pwrite(). Before these calls, positioning and reading a file required 2 kernel dives (one to seek and the other to read). Uh, but hold it. We are talking about cp(1) here—not random reads—where each successful read on the input file automatically adjusts the file pointer. That is, the input work loop would never have been encumbered with a pair of seek and read. Hmmm. Anyway, we can guess all day long why the Solaris folks chose to have cp(1) use mmap(2) as its input work horse, but in the end we’ll likely never know.
A Closer Look
In the following truss output, the Solaris cp(1) is copying a 5.8MB file to an output file called “xxx.” After getting a file descriptor for the input file, the output file is created. Next, mmap() is used on the input file (reading all 5.8MB since it is smaller than the 8MB operation limit). Next, the write call is used to write all 6161922 bytes from the mmap()ed region out to the output file (fd 4).
open64("SYSLOG-4", O_RDONLY) = 3
creat64("xxx", 0777) = 4
stat64("xxx", 0x00028640) = 0
fstat64(3, 0x000286D8) = 0
mmap64(0x00000000, 6161922, PROT_READ, MAP_SHARED, 3, 0) = 0xFEC00000
write(4, " F e b 5 1 0 : 3 6".., 6161922) = 6161922
munmap(0xFEC00000, 6161922) = 0
Of course if the file happened to be larger than 8MB, cp(1) would unmap and then remap the next chunk and on it would proceed in a loop until the input EOF is reached. That is a lot more “moving parts” than simply calling read(2) over and over clobbering the contents of a buffer allocated at the onset of the copy operation—without continual agitation of the VM subsystem with mmap().
I couldn’t imagine how cp(1) using mmap() would be any faster than read(2)/write(2). But then, it actually only replaces the input read with an mmap() while using write(2) on the output side. I couldn’t imagine how replacing just the input portion with mmap() would be faster than a cp() that uses a static heap buffer with read/write pairs. Moreover, I couldn’t picture how the mmap() approach would be easier on resources.
Measure Before Blogging
Not exactly me since I don’t have much Solaris gear around here. I asked Padraig O’Sullivan to compare the stock cp(1) of Solaris 10 to a GNU cp(1) with the modification I discuss in this blog entry. The goal at hand was to test whether the stock cp(1) constantly mapping and unmapping the input file is somehow faster or more gentle on processor cycles than starting out with a heap buffer and reusing it. The latter is exactly what GNU cp(1) does of course. Padraig asked:
One question about the benchmark you want me to run (I want to make sure I get all the data you want) – is this strategy ok?
1. Mount a UFS filesystem with the forcedirectio option
2. Create a 1 GB file on this filesystem
3. Copy the file using the standard cp(1) utility and record timing statistics
4. Copy the file using your modified cp8M utility and record timing statistics
Let me know if this is ok or if you want more information for the benchmark.
There was something else. I wanted a fresh system reboot just prior to each copy operation to make sure there were no variables. Padraig had the following to report:
[…] manage to run the benchmark in the manner you requested this morning […] Below is the results. I rebooted the machine before performing the copy each time.
# ls -l large_file
-rw-r–r– 1 root root 1048576000 Mar 5 10:25 large_file
# time /usr/bin/cp large_file large_file.1
# time /usr/bin/cp8m large_file large_file.2
Look, I’m just an old Sequent hack and yet the results didn’t surprise me. The throughput increased roughly 16% from 7.3MB/s to 8.5MB/s. What about resources? The tweaked GNU cp8M utilized roughly 26% less kernel mode processor cycles to do the same task. That’s not trivial since we didn’t actually eliminate any I/O. What’s that? Yes, cp8M reduces the wall clock time it takes to copy a 1000MB file by 16%–without eliminating any physical I/O!
Yes, blogging is supposed to be a bit controversial. It cultivates critical thought and the occasional gadfly, fly-by commentary I so dearly appreciate. Here we have a simple test that shows a “normal” cp(1) is slightly faster and significantly lighter on resources than the magical mmap()-enabled cp(1) that ships with Solaris. Does that mean there isn’t a case where the mmap() style is better? I don’t know. It could be that some really large Solaris box with dozens of concurrent cp(1) operations would show the value of the mmap() approach. If so, let us know.
I hope someone will volunteer to test cp8M on a high-end Solaris system. Maybe the results will help us understand why cp(1) uses mmap() on Solaris for its input file. Maybe not. Guess which way I’d wager.