ASM is “not really an optional extra” With BIGFILE Tablespaces

Published October 30, 2006 oracle 24 Comments

I was reading Chris Foot’s (very good) blog entry about BIGFILE Tablespaces when I choked on the following text (mind you, Foot is quoting someone else):

…ASM is not only recommended, but a requirement because of inode issues: “If you create a truly large bigfile tablespace on a traditional file system, you will suffer from horrendous inode locking issues. That is why ASM is not really an optional extra with these things.”

This is a great topic as I too feel BIGFILE tablespaces is a great feature. I have to point out that the bit about inode locking is a red herring.

Folks, inode locks are only an issue when file metadata changes. Fact is, when you use file system files for Oracle, metadata doesn’t change, file contents changes. If you deploy on a file system that supports direct I/O, one benefit you get is elimination of atime/mtime updates. These are the only metadata that would get changed on every file access—doom! With mtime/atime updates removed from the direct I/O codepath, the only metadata changes left are structural changes to the file—again, not the file contents. That is, if you create, remove, truncate or extend a file (even with direct I/O), then, yes of course, inode locks must held in an elevated mode. In the case of a cluster file system, that extends to a cluster-inode lock. Now, if your cluster file system happens to be some rudimentary central lock approach, as opposed to a symmetric/distributed (DLM) approach, then there are issues at that level—but only when the file changes (again, not the file contents).

The point being that unless you are extending a BIGFILE tablespace on a freakishly frequent basis, the inode thing is a red herring.

I really do hate it when non-issues like this are used to prop a technology choice such as ASM.

24 Responses to “ASM is “not really an optional extra” With BIGFILE Tablespaces”

Feed for this Entry Trackback Address

1 Noons October 30, 2006 at 10:34 pm

Please correct me if I’m wrong Kevin:
I do recall the inode may be used as well during random file access for sufficiently large files?

As in:
If the file is sufficiently large to warrant more than one inode – for example when using indirect blocks and/or double indirect blocks – then any random disk access past the addressing capacity of a single inode will forcibly use more than one of the beasties? Assuming of course traditional file system setup: not “largefiles” or some other workaround for larger inodes.

Not that this would impose a load on the inode lock, I suppose – until an indirection split happened. Even then, not really significant.(really digging into very early unix stuff here…).

I wonder: O_DIRECT io *still* must use the inode structures to address individual disk blocks, isn’t it? All it’s really just doing is bypassing the processing in the OS buffer cache?
Only raw disk io bypasses all the Unix file system paraphernalia and simply addresses the disk as a very long vector of 512byte disk blocks. Correct, or did I forget something important here?

Anyways, as you say: nothing to do with ASM use, of course. Just something I wanted to clarify and you’re the right person to ask.

Reply
2 Kevin October 30, 2006 at 11:07 pm

Hi Noons,

Thanks for the comment. So the reason I went on the rant to start with was the comment about “inode locking” in the blog entry I referred to. You are right that extremely large files have to incur indirection of one form or the other depending on how the blocks are associated with files (e.g., lists, hash, b*tree, extents, etc)–which differs based on implementation. I think the best way to address your comment is to point out that no indirect portion of a file in a filesystem can be sliced off without holding an exclusive lock on the file inode. So it isn’t as though locking overhead increases as files get larger. The structure gets more complex (depending on implementation), but having a plurality of inodes associated with a large file doesn’t inherently cause any sort of locking storm. At the end of the day, __changes__ to the file structure is what invokes locking overhead…accesses to file contents, on the other hand, do not cause a locking storm.

Now, lest someone point out that even reference counts on inodes (incurred when a file is simply opened) are protected by locks, I’ll point out that bumping the ref count on an inode does not cross locking paths with I/O in flight. In fact, not much does, really. And, modern locking primatives (reader/writer locks, etc) make this sort of lock very efficient (so opening files for instance is really cheap).

I hope I answered your question. I’m glad you are a reader too, btw.

Reply
3 Alex Gorbachev November 2, 2006 at 4:08 am

Well, I saw ASM environments configured with ASM, OMF, bigfile tablespaces and creating tablespace there is simple “CREATE TABLESPACE TS1;”
This proabably sets very small default value for the next chunk and might, perhaps, cause issues you mentioned.
I agree that it should be fixed by proper tablespace creation procedures rather than be an argument to choose certain technology.

Reply
4 Howard Rogers November 3, 2006 at 6:20 am

Well, since it was my quote that Chris quoted that you quoted and distorted, I’d like to set the record straight.

I never said, as per your blog title, that BIGFILEs “require” ASM. The clue was in the phrase “ASM is not really an optional extra with these things”, and I draw your attention to the word “really” in that sentence.

In case the subtlety was missed, the point was that though there is no technical reason why BIGFILEs cannot be used in non-ASM environments, the two complement each other nicely, and I still maintain you would be mad to think of implementing the one without the other. Or that you would be even contemplating one unless the other (or something equivalent) was being contemplated too.

I believe (it was a while ago now, so I can’t be sure) that I wrote the original comments in any case by way of rejoinder to Chris’s earlier enthusiasm for bigfiles. I was merely pointing out that bigfiles have issues, and that inodes are one of them, as even you have acknowledged in your original post and your comments in reply to Noons; and I was pointing that out to try to dampen Chris’s apparent enthusiasm for bigfiles, because I didn’t want his readers thinking they were the best thing since sliced bread.

If you truly need bigfiles, I’ll lay odds you’ve already implemented ASM or some other advanced storage solution. What I don’t want to see anyone doing is plonking bigfiles down on NTFS or ext3, because if that’s all their using for storage, they don’t need bigfiles in the first place.

Of course, if you discard context, quote a quote of a quote, and then re-write it into a headline containing something that was never actually said, you can get good mileage out of it, but it hardly warrants you choking on anything very much.

Reply
5 Kevin November 3, 2006 at 4:12 pm

Well, since it was my quote that Chris quoted that you quoted and distorted, I’d like to set the record straight.

…I do apologize for attributing to you this notion that ASM is a requirement for BIGFILE tablespaces.

I never said, as per your blog title, that BIGFILEs “require” ASM. The clue was in the phrase “ASM is not really an optional extra with these things”, and I draw your attention to the word “really” in that sentence.

…The title of the blog entry was not a quote.

In case the subtlety was missed, the point was that though there is no technical reason why BIGFILEs cannot be used in non-ASM environments, the two complement each other nicely, and I still maintain you would be mad to think of implementing the one without the other. Or that you would be even contemplating one unless the other (or something equivalent) was being contemplated too.
^^^^^^^^^^^^^^^^^^^^^^^

…The subtlety was missed, because I read the original context at http://my.opera.com/dizwell/blog/show.dml/40242 where you state matter of factly that single file contention will be suffered with BIGFILE due to “inode locking”. And I pointed out in my blog that inode locking is NOT and issue unless the file attributes are changing–which don’t when Oracle uses direct I/O. So, no, you didn’t say ASM was the only way to go, the quoted article by Chris Foot did and that was what I was quoting.

…I think what I understand you are trying to say is that BIGFILE shouldn’t be used on a junk filesystem that force POSIX write-ordering locks, maintains mtime/atime updates and doesn’t support online resizing. I couldn’t agree more and that was the point I was making in my blog entry.

I believe (it was a while ago now, so I can’t be sure) that I wrote the original comments in any case by way of rejoinder to Chris’s earlier enthusiasm for bigfiles. I was merely pointing out that bigfiles have issues, and that inodes are one of them, as even you have acknowledged in your original post and your comments in reply to Noons;

…No, I don’t (and didn’t) acknowledge that “inodes” are one of the “big issues” in my follow up with Noons, because they are not.

…I’ll give you this, if you implement BIGFILE without direct I/O on a traditional berkeley 8K-block UFS, you will be hating life because that environment would not be able to grow to accomodate the BIGFILE single datafile autoextend, and indeed you will see contention in the I/O path because of forced write ordering and other non-O_DIRECT monkey business.

If you truly need bigfiles, I’ll lay odds you’ve already implemented ASM or some other advanced storage solution.
^^^^^^^^^^^^^^^^^^^^^^^^^

…right, we see eye to eye

What I don’t want to see anyone doing is plonking bigfiles down on NTFS or ext3, because if that’s all their using for storage, they don’t need bigfiles in the first place.

…Agreed

Of course, if you discard context, quote a quote of a quote, and then re-write it into a headline containing something that was never actually said

…I hope my explaination herein will suffice. There was no ill intent.

Reply
6 Howard Rogers November 5, 2006 at 3:27 am

Intent doesn’t come into it. I’m sure Hitler never intended the Second World War, either… just hoped he’d get Poland for free as he had previously won Czechoslovakia.

Regardless of intent, the fact remains: I never said that bigfiles required ASM. I mentioned inode locking as ONE (not a “big”) issue, and you here acknowledge that on traditional file systems, inode locking CAN be an issue. And there is nothing “matter of factly” about eight words in the blog entry you now reference that mentioned inode locking entirely in passing whilst actually discussing parallel performance in bigfile temporary tablespaces.

If, indeed, the point you were seeking to make in your blog entry was that “BIGFILE shouldn’t be used on a junk filesystem”, then you could have said that and spared us the innuendo.

Over and out.

Reply
7 kevinclosson November 5, 2006 at 6:38 pm

I changed the title to accurately reflect what was quoted. Regardless of who I was quoting, I still think the phrase “ASM is required” conveys the same point as “ASM…not really an optional extra”.

Getting back to the point, inode locking is not an issue with filesystem BIGFILE tablespaces because when Oracle uses direct I/O, file metadata does not change.

Reply
8 Alex Gorbachev November 6, 2006 at 2:31 am

Well, I think that Kevin’s idea was to criticize the marketing campaign of ASM, as an example, and how it uses any truth or false rumors and out-of-context-quotes for its propaganda. Unintentionally, we often overestimate the impact of something in favor of a new super popular technology. That is indeed a good example of a someone-said-that-someone-said… approach with missed subtle in the end.
Obviously to me, the intent wasn’t to criticize someone (on the contrary referenced post mentioned as “very good”) personally; comparing that to Hitler is a bit excess I believe.

Reply
9 Howard Rogers November 12, 2006 at 7:50 am

Don’t be silly, Alex. I didn’t compare anyone or anything to Hitler. As an analogy only, what I said was that what one intends can be very different from what one achieves… and that therefore Kevin’s protestations of “no ill intent” were meaningless.

And Kevin is still twisting stuff. To say that “ASM is required” is the same thing as “ASM is not really an optional extra” is to distort, twist and strip out bucket-loads of context. It’s intellectually dishonest, actually.

And getting back to the point. Inode locking IS an issue if a bigfile tablespace keeps extending, as Kevin himself acknowledges. Guess what most people implementing bigfiles will NOT do? Hint: they won’t create the thing at its final, big, multi-terabyte size. So guess what bigfiles will tend to do a lot of? Hint: a lot of autoextending. Guess what is therefore going to be an issue: inode locking.

Kevin clearly has an agenda with ASM, which is fine. He’s entitled to hate it, think it irrelevant, think it a marketing ploy in search of a solution, but I personally don’t think that distortion, misquotation and construction of strawmen arguments helps clarify the issues.

As I said last time: if he wants to state that bigfiles and ‘ordinary’ file systems are a bad partnership, I will be in the vanguard of that particular cause. If he wants to trash ASM on all sorts of technical and marketing grounds, he and I will part company, but I respect his right to do so and the intellectual basis on which he takes the line that he does.

Reply
10 kevinclosson November 12, 2006 at 4:55 pm

…there…free speech…even on my blog.

Howard Rogers wrote:

“And getting back to the point. Inode locking IS an issue if a bigfile tablespace keeps extending, as Kevin himself acknowledges. Guess what most people implementing bigfiles will NOT do? Hint: they won’t create the thing at its final, big, multi-terabyte size. So guess what bigfiles will tend to do a lot of? Hint: a lot of autoextending. Guess what is therefore going to be an issue: inode locking.”

….So is it intellectually dishonest to interpret this as meaning you think that autoextending tablespaces is somehow inexpensive if stored in ASM? You know quite well that frequent autoextend is poison, layering inode contention on top of something that terribly expensive is noise.

…Is this thread over yet?

Reply
11 Howard Rogers November 13, 2006 at 3:37 am

Do you want it to be over?

Yes, I do happen to believe that autoextending is not a poison when it’s done in ASM. There’s a reason OMF made autoextension the default, and ASM is it, and I say that as one who went nuts with anybody daring to switch on autoextension for anything in versions 7, 8, 8i and 9i.

Now: instead of the nudge-nudge-wink-wink school of ripping misquotes out of context, if you’ve got something that demonstrates that belief to be misplaced, I’m all ears. But now we’re having a discussion about the merits or otherwise of ASM, not of the subtelties of bigfile tablespace implementation… which is fine, but just so you know.

Reply
12 kevinclosson November 13, 2006 at 1:14 pm

I’ll allow the thread to continue

Reply
13 Ari January 17, 2007 at 9:32 pm

“node locks are only an issue when file metadata changes”
This is the first time I’ve heard this. I’ve had a quick scout around various sources, and I can’t find support for this statement.
All the notes on the subject that I can find show that inode/POSIX locks are also used for controlling the order of writes and the consistency of reads. Which makes sense to me….

Refer to:
http://www.ixora.com.au/notes/inode_locks.htm

Sec 5.4.4 of
http://www.phptr.com/articles/article.asp?p=606585&seqNum=4&rl=1

Sec 2.4.5 of

Click to access oracle_fsperf.pdf

Table 15.2 of
http://www.informit.com/articles/article.asp?p=605371&seqNum=6&rl=1

Am I misunderstanding something?

Reply
14 kevinclosson January 17, 2007 at 11:27 pm

>Am I misunderstanding something?

…in short, yes. When I contrast ASM to a filesystem, I only include direct I/O filesystems. The number of filesystems and filesystem options that have eliminated the write-ordering locks is a very long list starting,i n my experience, with direct async I/O on Sequent UFS as far back as 1991 and continuing with VxFS with Quick I/O, VxFS with ODM, PolyServe PSFS with the DBOptimized mount option, Solaris UFS post Sol8-U3 with the forcedirectio mount option and others I’m sure. Databases do their own serialization so the filesystem doing so is not needed.

The ixora and solarisinternals references are very old (2001/2002). As I said, Solaris 8U3 direct I/O completely eliminates write-ordering locks. Further, Steve Adams also points out that Solaris 8U3 and Quick I/O where the only ones they were aware of, but that doesn’t mean VxFS ODM (2001), Sequent UFS (starting in 1992) and ptx/EFS, and PolyServe PSFS (2002) weren’t all supporting completely unencumbered concurrent writes.

Ari, thanks for reading and thanks for bringing these old links to my attention. Steve is a fellow Oaktable Network Member…I’ll have to let him know about this out of date stuff.

Way too much old (and incomplete) information out there.

Reply
15 kevinclosson January 17, 2007 at 11:31 pm

ugh, maybe not today:
This is the Postfix program at host deuteronomy.ixora.com.au.

I’m sorry to have to inform you that your message could not be delivered to one or more recipients. It’s attached below.

For further assistance, please send mail to

If you do so, please include this problem report. You can delete your own text from the attached returned message.

The Postfix program

(expanded from ): can’t create
user output file. Command output: procmail: Error while writing to
“/var/mail/steve”

Reply
16 Hans-Peter Sloot January 22, 2007 at 9:16 am

Hello Kevin,

Howard Rogers point about inodes may not be true but in my opinion he the statement in itself is true.
The past months I was involved in a data ware house project.

For bigfile tablespaces you need large logical volumes.
At least in my environment (Solaris 5.10), where the system administrators created ufs filesystems, a drop tablespace caused problems to the filesystem.
When dropping for example 100G the file system would be very slow for at least 15 minutes.
A df -k command showed very strange data (like a percentage with 8 digits) after taking minutes to complete.

Therefore I had to switch to ASM.
I have to drop and recreate tablespaces due to bug 5569640 everyday.
With ASM this is no problem whereas it was with normal filesystems.

For the flash recovery area I needed a +9T file system.
This has nothing to do with bigfiles but dropping backup sets etc would make the filesystem fairly unusable.

Regards Hans-Peter

Reply
17 kevinclosson January 22, 2007 at 3:51 pm

Hello Hans-Peter,

That you for that feedback. You point out one very interesting aspect of ASM and that is, it is the same on all platforms. You are right that differing filesystem implementations can show weaknesses with certain operations. I have no experience really with UFS on Sol. My background is largely Sequent filesystems (e.g., UFS and EFS), Veritas and PolyServe. Sequent UFS was a lot different than that of Sol.

I think you’ll notice that my blogging is very anti techno-religion and to that point it would be absurd for me to be against ASM when clearly solved a problem for you. What I stand against is in fact the opposite which is this bizarre mindset that ASM is better than any other choice–and at this point it isn’t.

Thanks for visiting my blog.

Reply
18 Mark Callaghan June 22, 2011 at 12:33 am

ext-2 and ext-3 on Linux hold the inode mutex during a write. ext-4 might also hold it. Metadata changes are not required.

Reply
19 kevinclosson June 22, 2011 at 1:42 am

Hi Mark,

You are right but not all locks are the kiss of death. The only thing that matters (in my/our) world is whether any such lock is acquired in exclusive mode thus breaking concurrent I/O when in the O_DIRECT code path. If you have a file system that requires such exclusion I’d recommend throwing it in the trash because it is about 15 years beyond shelf-life.

I’m glad the kernel locks files when an I/O is in flight. It wouldn’t be all that fun if the file disappeared from the file system (completely not just namespace) right from underneath a direct I/O.

You comment is welcome. Am I missing something more substantial in your addition to this thread?

Reply
20 Mark Callaghan June 22, 2011 at 2:00 am

From testing I did in 2007 and reading some of the source, the inode lock taken by ext-2/ext-3 is exclusive for the duration of a write. I was surprised to learn this as it isn’t well documented. xfs on Linux does not do this and is a much better choice for MySQL on Linux. I would love to learn I was wrong or this has changed.

Reply
- 21 kevinclosson June 22, 2011 at 3:48 pm
  
  Mark,
  
  Are you *sure* that this testing you did in 2007 included O_DIRECT? Was this a circa 2007 test of RH 2.1? RHEL 3/4/5? If not O_DIRECT, a serialized write probably wouldn’t affect throughput as much as CPU overhead because (unless a SYNC flag is thrown in) a non-O_DIRECT write is nothing but a kernel memcpy from user buffer to page cache. A serializing lock covering a memcpy burns CPU, but shouldn’t affect throughput all that much (unless you are close to saturated from other aspects of the workload). I cannot remember any Linux file system that serialized O_DIRECT writes but then I was involved with hard-wired cluster file system technology from 2001 to 2007 (PolyServe). PSFS was not stupid. So it didn’t serialize O_DIRECT writes. There is no reason under the sun to serialize O_DIRECT writes. If you find yourself on a file system that serialized O_DIRECT writes move off of it.
  
  Reply
22 Mark Callaghan June 22, 2011 at 4:30 pm

I just repeated the results on CentOS 5.2 — https://www.facebook.com/notes/mark-callaghan/xfs-ext-and-per-inode-mutexes/10150210901610933

Reply
- 23 kevinclosson June 22, 2011 at 6:56 pm
  
  You did indeed reproduce the results. I’m thinking of my friend Dave Chinner when I say this, “Don’t use file systems that suck!”
  
  I’ve actually covered all this ground before. Bad file systems are bad file systems. But:
  
  https://kevinclosson.wordpress.com/2007/01/18/yes-direct-io-means-concurrent-writes-oracle-doesnt-need-write-ordering/
  
  I’ll just reiterate what I’ve been saying all along. The file systems I have experience with mate direct I/O with concurrent I/O. Of course, I “have experience” with ext3 but have always discounted ext variants for many reasons most importantly the fact that I spent 2001 through 2007 with clustered Linux…entirely. So there was no ext on my plate nor in my cross-hairs.
  
  Reply
- 24 kevinclosson June 24, 2011 at 4:16 pm
  
  Hi Mark,
  
  You might see interesting results if you export that ext file system and mount it local to get network loopback speed but nfs i/o code path and semantics. So long as you have enough nfsd processes you should see multiple writer single file scalability. I recommend you google for nfs mount options I’ve recommended for Oracle in past blog posts.
  
  This is not a mere hypothesis. I’d be interested to see if you get the same result as I do in this approach.
  
  Reply

	Optimize replication… on Introducing SLOB – The S…
	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage