If Sir Isaac Newton was walking about today dropping apples to prove his theory of gravity, he’d feel about like I do making this blog entry. The topic? Concurrent writes on file system files with Direct I/O.
A couple of months back, I made a blog entry about BIGFILE tablespaces in ASM versus modern file systems.The controversy at hand at the time was about the dreadful OS locking overhead that must surely be associated with using large files in a file system. I spent a good deal of time tending to that blog entry pointing out that the world is no longer flat and such age-old concerns over OS locking overhead on modern file systems no longer relevant. Modern file systems support Direct I/O and one of the subtleties that seems to have been lost in the definition of Direct I/O is the elimination of the write-ordering locks that are required for regular file system access. The serialization is normally required so that if two processes should write to the same offset in the same file, one entire write must occur before the other—thus preventing fractured writes. With databases like Oracle, no two processes will write to the same offset in the same file at the same time. So why have the OS impose such locking? It doesn’t with modern file systems that support Direct I/O.
In regards to the blog entry called ASM is “not really an optional extra” With BIGFILE Tablespaces, a reader posted the following comment:
“node locks are only an issue when file metadata changes”
This is the first time I’ve heard this. I’ve had a quick scout around various sources, and I can’t find support for this statement.
All the notes on the subject that I can find show that inode/POSIX locks are also used for controlling the order of writes and the consistency of reads. Which makes sense to me….
Am I misunderstanding something?
And my reply:
…in short, yes. When I contrast ASM to a file system, I only include direct I/O file systems. The number of file systems and file system options that have eliminated the write-ordering locks is a very long list starting, in my experience, with direct async I/O on Sequent UFS as far back as 1991 and continuing with VxFS with Quick I/O, VxFS with ODM, PolyServe PSFS (with the DBOptimized mount option), Solaris UFS post Sol8-U3 with the forcedirectio mount option and others I’m sure. Databases do their own serialization so the file system doing so is not needed.
The ixora and solarisinternals references are very old (2001/2002). As I said, Solaris 8U3 direct I/O completely eliminates write-ordering locks. Further, Steve Adams also points out that Solaris 8U3 and Quick I/O where the only ones they were aware of, but that doesn’t mean VxFS ODM (2001), Sequent UFS (starting in 1992) and ptx/EFS, and PolyServe PSFS (2002) weren’t all supporting completely unencumbered concurrent writes.
Ari, thanks for reading and thanks for bringing these old links to my attention. Steve is a fellow Oaktable Network Member…I’ll have to let him know about this out of date stuff.
There is way too much old (and incomplete) information out there.
A Quick Test Case to Prove the Point
The following screen shot shows a shell process on one of my Proliant DL585s with Linux RHEL 4 and the PolyServe Database Utility for Oracle. The session is using the PolyServe PSFS filesystem mounted with the DBOptimized mount option which supports Direct I/O. The test consists of a single dd(1) process overwriting the first 8GB of a file that is a little over 16GB. The first invocation of dd(1) writes 2097152 4KB blocks in 283 seconds for an I/O rate of 7,410 writes per second. The next test consisted of executing 2 concurrent dd(1) processes each writing a 4GB portion of the file. Bear in mind that the age old, decrepit write-ordering locks of yester-year serialized writes. Without bypassing those write locks, two concurrent write-intensive processes cannot scale their writes on a single file. The screen shot shows that the concurrent write test achieved 12,633 writes per second. Although 12,633 represents only 85% scale-up, remember, these are physical I/Os—I have a lot of lab gear, but I’d have to look around for a LUN that can do more than 12,633 IOps and I wanted to belt out this post. The point is that on a “normal” file system, the second go around of foo.sh with two dd(1) processes would take the same amount of time to complete as the single dd(1) run. Why? Because both tests have the same amount of write payload and if the second foo.sh suffered serialization the completion times would be the same: