Archive Page 9

Little Things Doth Crabby Make – Part XVI (Addendum). Hey ls(1) And du(1) Are Supposed To Agree.

My last installment in the Little Things Doth Crabby Make  series had a lot of readers stepping up to remind me that ls(1) and du(1) aren’t always supposed to report the same size-related information on files. Uh, I actually knew that!

The post wasn’t about sparse files or any other such remedial aspects of file sizes.

In the post I mentioned that I was taking some rather unseemly actions against my XFS file system.

One particular unseemly thing I did was a the result of a bug in a small piece of my code.  Imagine for a moment that the loff_t variable sz in the following snippet was stupidly uninitialized/unassigned and the program steps on this syscall(__NR_fallocate,,,,) landmine.

 if ((ret = syscall(__NR_fallocate, fd, 0, (loff_t)0, (loff_t)sz)) != 0 )
 perror ("syscall.fallocate");

Well, if whatever happens to be stored in the variable sz is a really large value you’ll have a.out (allocate_file in my case) spinning in kernel mode for the rest of your life (at least on a 2.6.18 kernel). However, I got tired of it shortly after I snapped the following top(1) information:

 
 top - 11:47:27 up 3 days, 17 min, 4 users, load average: 1.00, 1.00, 1.00
 Tasks: 481 total, 2 running, 479 sleeping, 0 stopped, 0 zombie
 Cpu(s): 0.0%us, 4.2%sy, 0.0%ni, 95.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
 Mem: 49451520k total, 4065088k used, 45386432k free, 121492k buffers
 Swap: 50339636k total, 1044k used, 50338592k free, 3609352k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 12682 root 25 0 3648 308 248 R 99.7 0.0 880:23.09 allocate_file
 3997 root 15 0 13008 1416 816 R 1.0 0.0 10:25.16 top
 10100 gpadmin 15 0 111m 17m 2032 S 1.0 0.0 9:13.49 collectl
 1 root 15 0 10352 692 580 S 0.0 0.0 0:13.40 init
 2 root RT -5 0 0 0 S 0.0 0.0 0:00.10 migration/0
 3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
 4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
 5 root RT -5 0 0 0 S 0.0 0.0 0:00.10 migration/1
 6 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1
 7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
 8 root RT -5 0 0 0 S 0.0 0.0 0:00.21 migration/2
 9 root 34 19 0 0 0 S 0.0 0.0 0:00.08 ksoftirqd/2
 10 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
 11 root RT -5 0 0 0 S 0.0 0.0 0:04.91 migration/3
 12 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3
 13 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
 14 root RT -5 0 0 0 S 0.0 0.0 0:00.09 migration/4

It turned out my stupid error put the file system up to the task of allocating nearly 14TB to a file in a file system with about 200GB free. My mistake. However, the call should have failed instead of leaving me with a kernel-mode process that required a server reset to clear. But, alas, I was using a very old interface. If the particular test system I was investigating was running a more recent kernel I would have called fallocate(2) and the situation would most likely have been different but the kernel was older than the 2.6.23 minimum requirement for the fallocate(2) call.

So what does this have to do with ls(1) and du(1). Well, I had a lot of programs running that were thrashing the file system. I unearthed a race condition of some sort where my looping call to ls(1) managed to catch a glimpse of the file being populated by PID 12682 (see the top(1) output above). The ls(1) command reported zero bytes. The next line of the script executed microseconds (or less) later at which point du(1) was under the opinion the file was 287GB. Both the initial and subsequent df(1) information was consistent. I haven’t studied the transactional nature of this old rendition of fallocate so I can’t speculate what was going on. The only thing executing on the system at the time was, indeed, several invocations of the allocate_file program. It turns out that none of them branched to that call with an uninitialized grenade—as it were.

I was unable to reproduce the situation and lost interest after fixing that stupid bug in the allocate_file program.

If there is any moral to this story it would be that the level of unpredictability is unpredictable if a process unpredictably asks the kernel to do something it cannot possibly do such as allocate terabytes to a file in gigabytes of free space. I would predict, however, that >2.6.23 fallocate() would handle my goofy mistake differently.

I hate it when I can’t reproduce a problem.

Little Things Doth Crabby Make – Part XVII. I See xfs_mkfile(8) Making Fragmented Files.

BLOG UPDATE 21-NOV-2011: The comment thread for this post is extremely relevant.

 

I recently had an “exchange of ideas” with an individual. It was this individual’s assertion that modern systems exhibit memory latencies measured in microseconds.

Since I haven’t worked on a system with microsecond-memory since late in the last millennium I sort of let the conversation languish.

The topic of systems speeds and feeds was fresh on my mind from that conversation when I encountered something that motivated me to produce this installment in the Little Things Doth Crabby Make series.

This installment in the series has to do with disk scan throughput and file system fragmentation. But what does that have to do with modern systems’ memory latency? Well, I’ll try to explain.

Even though I haven’t had the displeasure of dealing with microsecond memory, this century, I do recall such ancient systems were routinely fed (and swamped) by just a few hundred megabytes per second disk scan throughput.

I try to keep things like that in perspective when I’m fretting over the loss of 126MB/s like I was the other day. Especially when the 126MB/s is a paltry 13% degradation in the systems I was analyzing! Modern systems are a modern marvel!

But what does any of that have to do with XFS and fragmentation? Please allow me to explain. I had a bit of testing going where 13% (for 126MB/s) did make me crabby (it’s Little Things Doth Crabby Make after all).

The synopsis of the test, and thus the central topic of this post, was:

  1. Create and initialize a 32GB file whilst the server is otherwise idle
  2. Flush the Linux page cache
  3. Use dd(1) to scan the file with 64KB reads — measure performance
  4. Use xfs_bmap(8) to report on file extent allocation and fragmentation

Step number 1 in the test varied the file creation/initialization method between the following three techniques/tools:

  1. xfs_mkfile(8)
  2. dd(1) with 1GB writes (yes, this works if you have sufficient memory)
  3. dd(1) with 64KB writes

The following screen-scrape shows that the xfs_mkfile(8) case rendered a file that delivered scan performance significantly worse than the two dd(1) cases. The degradation was 13%:

# xfs_mkfile 32g testfile
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 524288+0 records in
 524288+0 records out
 34359738368 bytes (34 GB) copied, 40.8091 seconds, 842 MB/s
 # xfs_bmap -v testfile > frag.xfs_mkfile.out 2>&1
 # rm -f testfile
 # dd if=/dev/zero of=testfile bs=1024M count=32
 32+0 records in
 32+0 records out
 34359738368 bytes (34 GB) copied, 22.1434 seconds, 1.6 GB/s
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 524288+0 records in
 524288+0 records out
 34359738368 bytes (34 GB) copied, 35.5057 seconds, 968 MB/s
 # xfs_bmap -v testfile > frag.ddLargeWrites.out 2>&1
 # rm testfile
 # df -h .
 Filesystem Size Used Avail Use% Mounted on
 /dev/sdb 2.7T 373G 2.4T 14% /data1
 # dd if=/dev/zero of=testfile bs=1M count=32678
 32678+0 records in
 32678+0 records out
 34265366528 bytes (34 GB) copied, 21.6339 seconds, 1.6 GB/s
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 522848+0 records in
 522848+0 records out
 34265366528 bytes (34 GB) copied, 35.3932 seconds, 968 MB/s
 # xfs_bmap -v testfile > frag.ddSmallWrites.out 2>&1

I was surprised by the xfs_mkfile(8) case. Let’s take a look at the xfs_bmap(8) output.

First, the two maps from the dd(1) files:

# cat frag.ddSmallWrites.out
 testfile:
 EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
 0: [0..9961471]: 1245119816..1255081287 6 (166187576..176149047) 9961472
 1: [9961472..26705919]: 1342791800..1359536247 7 (84037520..100781967) 16744448
 2: [26705920..43450367]: 1480316192..1497060639 8 (41739872..58484319) 16744448
 3: [43450368..66924543]: 1509826928..1533301103 8 (71250608..94724783) 23474176
 #
 # cat frag.ddLargeWrites.out
 testfile:
 EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
 0: [0..9928703]: 1245119816..1255048519 6 (166187576..176116279) 9928704
 1: [9928704..26673151]: 1342791800..1359536247 7 (84037520..100781967) 16744448
 2: [26673152..43417599]: 1480316192..1497060639 8 (41739872..58484319) 16744448
 3: [43417600..67108863]: 1509826928..1533518191 8 (71250608..94941871) 23691264

The mapping of file offsets to extents is quite close in the dd(1) file cases. Moreover, XFS gave me 4 extents for my 32GB file. I like that..but…

So what about the xfs_mkfile(8) case? Well, not so good.

I’ll post a blog update when I figure out more about what’s going on. In the meantime, I’ll just paste it and that will be the end of this post for the time being:

# cat frag.xfs_mkfile.out
testfile:
 EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
 0: [0..10239]: 719289592..719299831 4 (1432..11671) 10240
 1: [10240..14335]: 719300664..719304759 4 (12504..16599) 4096
 2: [14336..46591]: 719329072..719361327 4 (40912..73167) 32256
 3: [46592..78847]: 719361840..719394095 4 (73680..105935) 32256
 4: [78848..111103]: 719394608..719426863 4 (106448..138703) 32256
 5: [111104..143359]: 719427376..719459631 4 (139216..171471) 32256
 6: [143360..175615]: 719460144..719492399 4 (171984..204239) 32256
 7: [175616..207871]: 719492912..719525167 4 (204752..237007) 32256
 8: [207872..240127]: 719525680..719557935 4 (237520..269775) 32256
 [...3,964 lines deleted...]
 3972: [51041280..51073535]: 1115787376..1115819631 6 (36855136..36887391) 32256
 3973: [51073536..51083775]: 1115842464..1115852703 6 (36910224..36920463) 10240
 3974: [51083776..51116031]: 1115852912..1115885167 6 (36920672..36952927) 32256
 3975: [51116032..54897663]: 1142259368..1146040999 6 (63327128..67108759) 3781632
 3976: [54897664..55078911]: 1146077440..1146258687 6 (67145200..67326447) 181248
 3977: [55078912..56094207]: 1195607400..1196622695 6 (116675160..117690455) 1015296
 3978: [56094208..67108863]: 1245119816..1256134471 6 (166187576..177202231) 11014656

Oracle Exadata Database Machine Handily Handles The Largest Database In Oracle IT. What Does That Really Mean?

In my recent post entitled Oracle Executives Underestimate SPARC SuperCluster I/O Capability–By More Than 90 Percent! I offered some critical thinking regarding the nonsensical performance claims attributed to the Sun SPARC SuperCluster T4 in one of the keynotes at Oracle Openworld 2011. In the comment thread of that post a reader asks:

All – has anyone measured the actual IOPS from disk as well as from flash in your Exadata (production) environment and compare with what the Oracle white paper or CXO presentations claimed?

That is a good question. It turns out the reader is in luck. There happens to be really interesting public information that can answer his question. According to this searchoracle.techtarget.com  article, Campbell Webb, Oracle’s VP of Product Development IT refers to Oracle’s Beehive email and collaboration database as “Oracle’s largest backend database.” Elsewhere in the article, the author writes:

Oracle’s largest in-house database is a 101-terabyte database running Beehive, the company’s in-house email and collaboration software, running on nine Oracle Exadata boxes.

As it turns out, Oracle’s Campbell Webb delivered a presentation on the Beehive system at Oracle Openworld 2011.  The presentation (PDF) can be found here. I’ll focus on some screenshots of that PDF to finish out this post.

According the following slide from the presentation we glean the following facts about Oracle’s Beehive database:

  • The Beehive database is approximately 93TB
  • Redo generation peaks at 20MB/s—although that is unclear if that is per instance or the aggregate of all instances of this Real Application Clusters Database
  • User SQL executions peaks at roughly 30,000 per second

The techtarget.com piece quotes Campbell Webb as stating the configuration is 9 racks of Exadata gear with 24 instances of the database—but “only” 16 are currently active. That is a lot of Oracle instances and, indeed, a lot of instances can drive a great deal of physical I/O. Simply put, a 9-rack Exadata system is gargantuan.

The following is a zoom-in photo of slide 12 from the presentation. It spells out that the configuration has the standard 14 Exadata Storage Servers per rack (126 / 14 == 9) and that the hosts are X2-2 models. In a standard configuration there would be 72 database hosts in a 9-rack X2-2 configuration but the techtarget.com article quotes Webb as stating 16 are active and there only 24 in total. More on that later.

With this much gear we should expect astounding database throughput statistics. That turns out to not be the case. The following slide shows:

  • 4,000,000 logical I/O per second at peak utilization. That’s 250,000 db block gets + db block consistent gets (cache-buffers chain walks) per second per active host (16 hosts). That’s a good rate of SGA buffer pool cache activity—but not a crushing load for 2S Westmere EP.
  • The physical read to write ratio is 88:12.
  • Multiblock physical I/Os are fulfilled by Exadata Storage Servers on average at 6 or less milliseconds
  • Single block reads are largely satisfied in Exadata Smart Flash Cache as is evidenced by the 1ms waits
  • Finally, database physical I/O peaks at 176,000 per second

176,000 IOPS
With 126 storage servers there is roughly 47TB of Exadata Smart Flash Cache. Considering the service times for single block reads there is clear evidence that the cache management is keeping the right data in the cache. That’s a good thing.

On the other hand, I see a cluster of 16 2U dual-socket Westmere-EP Real Application Clusters servers driving peak IOPS of 176,000. Someone please poke me with a stick because I’m bored to death—falling asleep. Nine racks of Exadata is capacity for 13,500,000 IOPS (read operations only of course). This database is driving 1% of that. 

Nine racks of Exadata should have 72 database hosts. I understand not racking them if you don’t need them, but the configuration is using less than 2 active hosts per rack—but, yes, there are 24 cabled (less than 3 per rack). Leaving out 48 X2-2 hosts is 96U—more than a full rack worth of aggregate wasted space. I don’t understand that.  The servers are likely in the racks—powered off. You, the Oracle customer, can’t do that because you aren’t Oracle Product Development IT. You’ll be looking at capex—or a custom Exadata configuration if you need 16 hosts fed by 126 cells.

Parting Thoughts
It is not difficult to configure a Real Application Clusters system capable of beating 16 2-socket Westmere EP servers, with their 176,000 IOPS demand, with far, far less than 9 racks of hardware.  It would be Oracle software just the same—just no Exadata bragging rights. And, once a modern, best-of-breed system  is happily steaming along its way hustling 176,000 IOPS, you could even call it an “Engineered System.” There’d be no Exadata bragging rights though. Just a good system handling a moderate workload. There is nothing about this workload that can’t be easily handled with conventional, best-of-breed storage. EMC technology with FAST quickly comes to mind.

Beehive is Oracle’s largest database and it runs on a huge Exadata configuration. Those two facts put together do not make any earth-shattering proof point when you study the numbers.

I don’t get it. Well, actually I do.

By the way, did I mention that 176,000 IOPS is not a heavy IOPS load–especially when only 12% of them are writes?

Little Things Doth Crabby Make – Part XVI. Hey ls(1) And du(1) Are Supposed To Agree.

BLOG UPDATE 08-NOV-2011 : After posting and handling comments I realize the title is drawing readers’ attention to the wrong thing. I am aware that ls(1) and du(1) can and should disagree about file size and space utilization because they are two different things (e.g., sparse files and non-file data associated with the file). The point of this blog was just to show that within microseconds (or less) ls(1) reported my file was zero bytes and du(1) reported it occupied nearly 300GB space in the file system. That is the “oddity”–or is it? The original post follows:

I love XFS.

I’ve been doing some, shall we say, rather unseemly sorts of things to one of my XFS file systems. Things are generally holding up, however, a little something made me crabby today so I think another installment in my Little Things Doth Crabby Make series is in order.

I don’t think I even need to explain why the following is something which hath crabby made:

# uname -r
2.6.18-194.26.1.el5
# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb2             2.0T  1.7T  236G  88% /data1
# ls -l foo
-rw-r--r-- 1 root root 0 Nov  7 10:53 foo
# du -sh foo
287G    foo
# rm -f foo
$ ls -l foo
ls: foo: No such file or directory
# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb2             2.0T  1.7T  236G  88% /data1

I’ll blog more on what’s happening here as soon as I can. In the meantime, I’m crabby.  Not really, that’s just the theme of this blog series.

Flash Is Fast! Provisioning Flash For Oracle Database Redo Logging? EMC F.A.S.T. Is Flash And Fast But Leaves Redo Where It Belongs.

Guy Harrison has been blogging his findings regarding solid state disk testing/performance with Oracle. Guy’s tests and reports are very thorough. This is a good body of work. The links to follow are at the bottom of this post.

Before reading, however, please consider the following thoughts about solid state disk as pertaining to Oracle I/O performance and flash:

  1. Oracle Database Smart Flash Cache (DSFC) is flash storage with libC/libaio physical I/O to “augment” the SGA. When using DSFC you will sustain DBWR writes even if your application only uses the SELECT statement. Few people are aware of this fact. The real SGA buffer pool becomes “L1” cache and DBWR spills clean blocks to the L2 (DSFC) where subsequent logical I/O (cache buffers chain hits) will actually require a physical read from flash. Sessions cannot access buffered data in the DSFC. Blocks have to first be read from flash into the SGA before the session can get along with its business. I have never seen DSFC work effectively. I have, on the other hand, seen a whitepaper showing read-intensive “oltp” serviced by an SGA that is much smaller than available RAM on the system but and augmented by flash. However, the paper is really just showing that flash reads are better than reads from an under-configured hard disk farm. I’ll blog about that paper soon.
  2. Database Smart Flash Cache, Really?  Don’t augment DRAM until you’ve maxed out DRAM.  Do I honestly aim to assert that augmenting nanosecond operations with millisecond operations it non-optimal?, that is my assertion.
  3. DBWR spills to DSFC aren’t charged to sessions, but such activity does take system bandwidth.
  4. If you find a real-life workload that benefits from DSFC please let me know.
  5. Redo Log physical I/O is a well-suited for round, brown spinning disks. Just don’t melt the disks with DBWR random writes and expect the same disks to favor the occasional large sequential writes issued by LGWR. Watch your log file parallel write (LFPW) wait events as a component of log file sync (LFS) and you’ll usually find that LFS is a CPU problem, not a LFPW problem. Read this popular post for more on that matter.
  6. Don’t expect much performance increase, in an Exadata environment, from the new “Exadata Smart Flash Log” (ESFL) feature. It may indeed smooth out LFPW service times, and that is a good thing, but the main benefit Smart Flash Log delivers to Exadata customers is relief from the HDD controller starvation (not to mention the imbalance of processors to spindles) that happens in Exadata when DBWR and LGWR are simultaneously demanding IOPS from the limited number of spindles in Exadata standard configurations. Remember, Exadata’s design center is read bandwidth, not write IOPS. A full rack X2 Exadata configuration can only sustain on the order of 25,000 mirrored random writes per second. The closer one pushes Exadata to that 25K IOPS figure the more trouble LGWR has with log flushes. That is, Exadata Smart Flash Log (ESFL) is really just a way to flow LGWR traffic through different adaptors (PCI Flash). In fact, simply plugging in another LSI HDD controller with a few SATA drives dedicated to redo streaming writes would actually do quite well if not as well as ESFL. That is a topic for a different post.
  7. EMC Fully Automated Storage Tiering (FAST) does not muddle around with redo because redo does really well with spinning disks. Hard disk drives do just fine with large sequential writes.
  8. Oracle Database 11g Release 2 added the ability to specify the block size for a redo log. If you do feel compelled to flush redo to solid state I recommend you crack open the documentation for the BLOCKSIZE syntax and add some 4K blocking-factor logs. I made this point on Guy’s blog comment section. I’m not sure if his tests were 4K block size or if they were flushing redo with 512-byte alignment (which flash really doesn’t favor).

And now references to Guy Harrison’s posts:

http://guyharrison.squarespace.com/blog/2011/10/27/using-flash-disk-for-redo-on-exadata.html

http://guyharrison.squarespace.com/ssdguide/04-evaluating-the-options-for-exploiting-ssd.html

Oracle Database Appliance–Bringing Exadata To The Masses. And, No More Patching!

BLOG UPDATE 26-SEP-2011: A lot of the content in this post conveys my strong feelings about throwing around the word “appliance” in the context of Information Technology. Readers have pointed out (in the comment thread below) that my assessment of Oracle Database Appliance vis a vis “appliance status” is akin to spreading half-truths because I work on a product at EMC called an appliance that some, or many, would not deem an appliance. As you read this post, please bear in mind that I do address readers’ views on that matter in the comment thread. The original post follows:

I just googled ‘Oracle Database Appliance’ +Exadata and got offered 446,000 goodies to click on. There are only two problems with that:

1.     Exadata is not an appliance.

2.     Oracle Database Appliance has no Exadata software in it.

Get Out Of Jail Free Card
In this Computerworld article, Mark Hurd is quoted as saying the Oracle Database Appliance  brings “the benefits of Exadata to entry-level systems.” So, I googled ‘this brings the benefits of Exadata to entry level systems’ and was offered 36,300 nuggets of wisdom to read.

I have only one thing to say about this big news. There is a huge difference between a pre-configured system and an appliance.

I’ve never had to apply a patch to my toaster.  The Oracle Database Appliance is not an appliance, it is a pre-configured Real Application Clusters system.

SMB (Small/Medium Business) + Real Application Clusters? Who is handing out the get out of jail free cards?  Who briefed Oracle’s Executives on what this thing actually is before they started talking about it? Exadata and Oracle Database Appliance are pre-configured, yes, so perhaps that is what Mark Hurd means when he said  “the benefits of Exadata.” If that is the case, I agree. Pre-configured Oracle software is a *huge* benefit because it is very complex.

Oracle Database 11g Direct NFS + Real Application Clusters + VMware vSphere + Fully Automated Storage Tiering? Yes! Of Course!

This is just a quick blog entry to draw attention to a freshly-released EMC white paper.  I don’t aim to turn my blog into an announcement board for such things, but this one is worth it. When I posted my announcement that I’d left Oracle Server Technologies (Exadata development) to join EMC I should have also made the point of how much interest I have in virtualization.

Virtualization (Done Right) and Oracle Database 11g Direct NFS
My convictions regarding the importance of virtualization in modern data center architecture are quite strong. That’s a significant portion of the motivation behind why I left Oracle to join EMC and one of the reasons I really like this paper.  But that’s not all. The paper also centers on Oracle Direct NFS technology. Regular readers know my strong backing of dNFS as well as my long history with the technology.  New readers can visit the past on that matter by reading my many posts on the topic.

The following is a short list of items covered in the paper:

  1. FAST (Fully Automated Storage Tiering). I cannot say enough about the importance of taking care to cache both reads and writes! Imagine that! Not all caching schemes possess that critical attribute! If you try really hard you can probably think of a few dynamic caching solutions that are really good at caching clean data and offer no benefit for writes.
  2. Oracle Real Application Clusters OLTP performance with the Oracle Database 11g Direct NFS feature with and without FAST (Fully Automated Storage Tiering) on physical servers.
  3. Oracle Real Application Clusters OLTP performance with the Oracle Database 11g Direct NFS feature with and without FAST (Fully Automated Storage Tiering) on a VMware vShere virtual servers.
  4. Loads of configuration how-to’s
  5. Coverage of live migration from physical to virtualized Oracle Real Application Clusters. This is my personal favorite in this paper!

Here is a link to the paper: EMC Performance for Oracle – EMC VNX, Enterprise Flash Drives, FAST Cache, VMware vShere

I wanted folks to get this paper as soon as possible so they’d have something good to read during their flight to Oracle OpenWorld 2011. If you read it you might come up with some difficult questions to pose to the EMC folks in Booth 901 🙂  Go ahead, give it a go. Tell them I said, “Hi” because this will be the first OOW I’ve missed since 1997.

Application Developers Asking You For Urgent Response To A Database Provisioning Request? Tell Them: “Go Do It Yourself!”

…then calmly close the door and get back to work! They’ll be exceedingly happy!

The rate at which new applications pour forth from corporate IT is astounding. Nimble businesses, new and old, react to bright ideas quickly and doing so often requires a new application.  Sure, the backbone ERP system is critical to the business and without it there would be no need for any other application in the enterprise. This I know. However…

When an application developer is done white-boarding a high-level design to respond to a bright idea in the enterprise it’s off to the DBA Team to get the train rolling for a database to back-end the new application. I’d like to tell the DBA Team what to tell the application developer. Are you ready? The response should be:

Go do it yourself! Leave me alone. I’m busy with the ERP system

You see, the DBA Team can say that and still be a good corporate citizen because this hypothetical DBA Team works in a 21st century IT shop where Database As A Service is not just something they read about in the same blog I’ve been following for several years, namely Steve Bobrowski’s blog Database As A Service.

Steve’s blog contains a list of some of the pioneers in this technology space. I’m hoping that my trackback to his blog will entice him to include a joint VMware/EMC product on the list. I’d like to introduce readers of this blog to a very exciting technology that I think goes a long way towards realizing the best of what cloud database infrastructure can offer:

VMware vFabric(tm) Data Director

I encourage readers to view this demo of vFabric Data Director and  read the datasheet because this technology is not just chest-thumping IdeaWare™.  I am convinced this is the technology that will allow those in the DBA community to tell their application developers to “go do it yourself” and make their company benefit from IT even more by doing so.

What Can This Post Possibly Have To Do With Oracle Exadata?
Folks who read this blog know I can’t resist injecting trivial pursuit.

The architect and lead developer of vFabric Data Director technology is one of the three concept inventors of Oracle Exadata or, as it was soon to be called within Oracle, Storage Appliance for Grid Environments (SAGE). One of the others of that “team of three” was a crazy-bright engineer with whom I spent time scrutinizing the affect of NUMA on spinlocks (latches) in Oracle Database in the Oracle8i time frame.

It is a small world and, don’t forget, if a gifted application developer approaches your desk for a timely, urgent request for database provisioning just tell him/her to go do it yourself! They’ll be glad you did!

File Systems For A Database? Choose One That Couples Direct I/O and Concurrent I/O. What’s This Have To Do With NFS? Harken Back 5.2 Years To Find Out.

It was only 1,747 days ago that I posted one of the final blog entries in a long series of posts regarding multi-headed scalable NAS suitability for Oracle Database (see index of NAS-related posts).  The post,  entitled ASM is “not really an optional extra” With BIGFILE tablespaces, aimed to question the assertion that one must use ASM for bigfile tablespaces. At the time there were writings on the web that suggested a black and white state of affairs regarding what type of storage can handle concurrent write operations. The assertion was that ASM supported concurrent writes and all file systems imposed the POSIX write-ordering semantics and therefore they’d be bunk for bigfile tablespace support. In so many words I stated that any file system that matters for Oracle supports concurrent I/O when Oracle uses direct I/O. A long comment thread ensued and instead of rehashing points I made in the long series of prior posts on the matter, I decided to make a fresh entry a few weeks later entitled Yes Direct I/O Means Concurrent Writes. That’s all still over 5 years ago.

Please don’t worry I’m not blogging about 151,000,000 seconds-old blog posts. I’m revisiting this topic because a reader posted a fresh comment on the 41,944 hour-old post to point out that Ext derivatives implement write-ordering locks even with O_DIRECT opens. I followed up with:

I’m thinking of my friend Dave Chinner when I say this, “Don’t use file systems that suck!”

I’ll just reiterate what I’ve been saying all along. The file systems I have experience with mate direct I/O with concurrent I/O. Of course, I “have experience” with ext3 but have always discounted ext variants for many reasons most importantly the fact that I spent 2001 through 2007 with clustered Linux…entirely. So there was no ext on my plate nor in my cross-hairs.

I then recommended to the reader that he try his tests with NFS to see that the underlying file system (in th NFS server) really doesn’t matter in this regard because NFS supports direct I/O with concurrent writes. I got no response from that recommendation so I set up a quick proof and thought I’d post the information here. If I haven’t lost you yet for resurrecting a 249-week old topic, please read on:

File Systems That Matter
I mentioned Dave Chinner because he is the kernel maintainer for XFS. XFS matters, NFS matters and honestly, most file systems that are smart enough to drop write-ordering when supporting direct I/O matter.

To help readers see my point I set up a test wherein:

  1. I use a simple script to measure single-file write scalability from one to two writers with Ext3
  2. I then export that Ext3 file system via loopback and accessed the files via an NFS mount to ascertain single-file write scalability from one to two writers.
  3. I then performed the same test as in step 1 with XFS.
  4. I then export the XFS file system and mount it via NFS to repeat the same test as in step 2.

Instead of a full-featured benchmark kit (e.g., fio, sysbench, iometer, bonnie, ORION) I used a simple script because a simple script will do. I’ll post links to the scripts at the end of this post.

Test Case 1

The following shows a freshly created Ext file system, creation of a single 4GB file and flushing of the page cache. I then execute the test.sh script first with a single process (dd with oflag=direct and conv=notrunc) and then with two. The result is no scalability.

# mkfs.ext3 /dev/sdd1
mke2fs 1.39 (29-May-2006)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
13123584 inodes, 26216064 blocks
1310803 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
801 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 31 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
# mount /dev/sdd1 /disk
# cd /disk
# tar zxf /tmp/TEST_KIT.tar.gz
# sh -x ./setup.sh
+ dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 9.64347 seconds, 445 MB/s
#
#
# sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches
#
# sh ./test.sh 1
24
# sh ./tally.sh 24
TotIO: 524288 Tm: 24 IOPS: 21845.3
# sh ./test.sh 2
49
# sh ./tally.sh 49
TotIO: 1048576 Tm: 49 IOPS: 21399.5

Ext is a file system I truly do not care about.  So what if I run the workload accessing the downwind files via NFS?

Test Case 2

The following shows that I set up to serve the ext3 file system via NFS, mounted it loopback-local and re-ran the test. The baseline suffered 32% decline in IOPS because a) ext3 isn’t exactly a good embedded file system for a filer and b) I didn’t tune anything. However, the model shows 75% scalability. That’s more than zero scalability.

#  service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
# mount -t nfs -o rw,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768 localhost:/disk /mnt
# cd /mnt
# rm bigfile
# sh -x ./setup.sh
+ dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 9.83931 seconds, 437 MB/s
# sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches
# pwd
/mnt
# sh ./test.sh 1
37
# sh ./tally.sh 37
TotIO: 524288 Tm: 37 IOPS: 14169.9
# sh ./test.sh 2
49
# sh ./tally.sh 49
TotIO: 1048576 Tm: 49 IOPS: 21399.5

Test Case 3

Next I moved on to test the non-NFS case with XFS. The baseline showed parity with the single-writer Ext case, but the two-writer case showed 40% improvement in IOPS. Going from one to two writers exhibited 70% scalability. Don’t hold that against me though, it was a small setup with 6 disks in RAID5. It’s maxed out. Nonetheless, any scalability is certainly more than no scalability so the test proved my point.

# umount /mnt
# service nfs stop
Shutting down NFS mountd:                                  [  OK  ]
Shutting down NFS daemon:                                  [  OK  ]
Shutting down NFS quotas:                                  [  OK  ]
Shutting down NFS services:                                [  OK  ]
# umount /disk

# mkfs.xfs /dev/sdd1
mkfs.xfs: /dev/sdd1 appears to contain an existing filesystem (ext3).
mkfs.xfs: Use the -f option to force overwrite.
# mkfs.xfs /dev/sdd1 -f
meta-data=/dev/sdd1              isize=256    agcount=16, agsize=1638504 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=26216064, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal log           bsize=4096   blocks=12800, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount /dev/sdd1 /disk
# cd /disk
# tar zxf /tmp/TEST_KIT.tar.gz
# sh -x ./setup.sh
+ dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 4.83153 seconds, 889 MB/s
# sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches
# sh ./test.sh 1
24
# sh ./tally.sh 24
TotIO: 524288 Tm: 24 IOPS: 21845.3
# sh ./test.sh 2
35
# sh ./tally.sh 35
TotIO: 1048576 Tm: 35 IOPS: 29959.3

Test Case 4

I then served up the XFS file system via NFS. The baseline (single writer) showed 16% improvement over the NFS-exported ext3 case. Scalability was 81%. Sandbag the baseline, improve the scalability! 🙂 Joking aside, this proves the point about direct/concurrent on NFS as well.

# cd /
# service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
# mount -t nfs -o rw,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768 localhost:/disk /mnt
# cd /mnt
# rm bigfile
# sh -x ./setup.sh
+ dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 6.95507 seconds, 618 MB/s
# sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches
# sh ./test.sh 1
32
# sh ./tally.sh 32
TotIO: 524288 Tm: 32 IOPS: 16384.0
# sh ./test.sh 2
40
# sh ./tally.sh 40
TotIO: 1048576 Tm: 40 IOPS: 26214.4
Scripts and example script output:

test.sh
tally.sh
example of test.sh output (handle with tally.sh)

The Moral Of This Blog Entry Is?
…multifarious:

  • Don’t leave comments open on threads for 5.2 years
  • Use file systems suited to the task at hand
  • Kevin is (and has always been) a huge proponent of the NFS storage provisioning model for Oracle
  • ASM is not required for scalable writes

I Do Believe In One-Size-Fits-All Solutions! Humor?

My recent posts about certain technical marketing motions about certain information technology has kept me awake at night. However, my readers span all time zones so I need to remain vigilant in handling email and blog comments. I need a one-size-fits-all solution. I think I’ll pop a couple of these:

Oracle Apps? VMware? Interesting? Throw It Out The Window.

When I announced I was giving up my Architect post in Oracle Server Technologies  I specifically called out some of the technology I was interested in pursuing. One of those areas is Oracle software on VMware. For those of you who share that penchant you might take interest in Technology Defenestration.  Interesting name for an Apps DBA blog! I’ve read quite a bit of J’s writings and have learned some interesting things.

I like blogs written by people “in the trenches.”

Perhaps you will too.

I Can See Clearly Now. Exadata Is Better Than EMC Storage! I Have Seen The Slides! Part I.

Oracle acquired Pillar Data Systems on June 29, 2011  and I was quite silent on the matter because the topic was not blog-worthy. But, I’m not blogging about the blog-worthiness of old news. The day after the news regarding the Pillar acquisition, Oracle executives held a webinar covering the latest turn in Oracle’s storage strategy.  Some of the content in that webinar was, in fact, worthy of a blog entry.

The webinar slides can be found here: Oracle Storage Strategy Update June 29, 2011 (slides).

Fair or Foul?
Imagine for a moment that SAP spends the money to compete in the America’s Cup to race the Oracle boat. Imagine further that the captain of the boat fouled in some way during a race (I know nothing about sailboat races because I’m a mere mortal). It would take Oracle executives about 42 seconds to call out the issue. Well, as I see it, slide number 15 in the above-referenced Storage Strategy webcast slide deck is a foul and while it took me longer than 42 seconds to get to the matter, I’d like to address the matter now.

As the following screen shot of slide 15 shows, Oracle is specifically targeting EMC storage technology to contrast to Exadata. Forget for a moment that the design center for Exadata was DW/BI so the logical comparison would be to Greenplum. That’s saved for Part II. The problem with calling out EMC specifically is in doing so one tends to attract the attention of guys like me. You know, guys who know Exadata really, really well. I’m going to strive to choose my words carefully at this point as I’m sure Oracle did when they scooped all those words into slide 15.

In my assessment, Oracle used misleading information while poking the Slide 15 Stick™ at EMC.  But not misleading in the manner you might expect. The misleading information isn’t what Oracle stated about the EMC product in the comparison. Instead, the misleading information is actually in under the Exadata column! So, I’ll call out these errata in the following list and then show the screenshot after that.

  1. Database IOs per second. This bullet grants a dubious 150,000 to “EMC Symmetrix” and clearly sets the stage for an OLTP-slanted comparison by citing the 1.5 million read IOPS (datasheet) capability for a single full-rack Exadata as per the information in the Exadata column. Don’t get me wrong, I don’t question the 1.5 million read IOPS capacity for 8KB random I/O because I held the position of Performance Architect in Oracle’s Exadata development organization for years and I’ve used my own test kits to get that many IOPS on my own Exadata lab gear. But stick with me for a moment. It’s really, really faulty marketing to pick random capabilities cited in the datasheet and plop them in a slide for your executives to present. The next row in the Exadata column cites a compression figure–and a Hybrid Columnar Compression (HCC) figure at that. Folks, Exadata cannot achieve the datasheet read IOPS figures (1.5 million) with compressed data—most particularly not HCC data. By citing the the 1.5 million read IOPS Oracle sets the stage that this is an OLTP/ERP-slanted comparison to EMC. After all, IOPS is not a DW/BI issue. So I’m scrutinizing the rest of the list from that position.
  2. Average Database Compression Factor. This is a new move for Oracle. I’ve not seen the claims for “average database compression” before. Generally, you’ll see  wording such as “up to 10x compression” or “10x to 15x” compression when referring to the full HCC suite–which offers both query and archive compression. But, no matter. What you shouldn’t like about the claim of “average database compression” is the fact that Exadata is marketed as an OLTP machine yet HCC cannot be used for OLTP. So if you’ve bought Exadata for OLTP you’re probably wondering where your “average” 10x savings is. On the other hand, if you are using Exadata for OLTP, and wisely employing  the only compression technology suitable for OLTP ( Oracle Advanced Compression (a.k.a, ACO) ) you are enjoying the same 2-4x compression you would have if your database was stored on any storage—even EMC! Imagine that. Oracle and EMC share tens of thousands of Oracle customers running OLTP or ERP and enjoying the same compression ratios across the board. So, stating zero for EMC in the compression class is, well, you know…
  3. Database bandwidth, high performance disks. This line in the table is bizarre. I know the Exadata bandwidth datasheet numbers by heart (shouldn’t come as a surprise). The full-rack X2 models with High Performance drives sustain 25 GB/s when scanning only the HDD disks. When scanning both HDD and flash cache concurrently the bandwidth number jumps to 75 GB/s—again, for a full-rack with High Performance drives. I can only surmise that the person who filled in this slide simply multiplied the HDD-only scan rate by the “average” compression factor (10x) to get 250 GB/s. That means the row should be called “effective bandwidth.”  But then that isn’t the real problem I have with this point. The row is labeled “database bandwidth.”  OLTP databases are databases and I assure you there is no 10x effective increase for OLTP, ERP database I/O bandwidth with Exadata because (and I can’t repeat this enough) there is no HCC for OLTP/ERP use cases. Perhaps the row should be called “HDD Effective Scan Rate” which has little to nothing to do with OLTP so I’m looping back to the 1.5 million read IOPS citation again.
  4. Database cache usable capacity. This one is really broken. First off, we all know that the “database cache” is first and foremost the Oracle System Global Area which all Oracle databases have regardless of the storage. That is the database cache. If the slide was meant to allude to “storage cache” then fine, let’s take it that way. The slide cites 53 TB for Exadata. Those are lovely characters and numbers but they don’t mean anything. The Exadata Storage Server (X2) has 384 GB of Exadata Smart Flash Cache in each of the 14 storage servers in a full-rack configuration. That’s 5,376 GB. I think, here again, the cited number is a mash-job of physical capacity and the DW/BI compression ratio commonly achieved with HCC (10x).  The title of the slide is “Better than EMC for Database I/O.” OLTP databases are indeed databases and OLTP is not compatible with HCC compression. So, the information is…a foul.

So, as in the case of my America’s Cup racing foul example, I cry foul. It took me more than 42 seconds to do so though.

Here is Slide 15:

Follow this link to Part II in the series.

Quit Blogging or Give a Quick Update With Pointers To A Good Blog, A Glimpse Of The Future And A Photo Of Something From The Past? Yes.

I haven’t quit blogging, it’s just that I haven’t had a spare moment since joining EMC Data Computing Division back in March. My involvement in the upcoming Apress book about Exadata entitled Expert Oracle Exadata is complete so that is one of the few side-bar efforts that occasionally held me back from blogging. So, I should be able to craft some interesting content soon. I have a huge backlog of material I need to cover.

Now that I’m mentioning that Apress book I realize I still haven’t added Kerry Osborne to my blog roll. That’s a serious oversight. Folks, don’t miss Kerry’s blog!

What Is An Asymmetrical MPP? I Know What I Mean When I Say That. Do You?
After this blogging hiatus is over, I plan to start a blog series to cover an interesting architectural characteristic of data warehouse solutions. No, I’m not going to be the billionth person to regurgitate the phrase “shared-nothing MPP” because it simply doesn’t matter. That’s an argument for academics. Readers of this blog have commercial needs for business solutions—and IT budgets. The topic I’ll be blogging about is MPP symmetry or, more accurately, lack thereof. So I need terminology. In fact it would be nice to coin a term. Unfortunately for me, however, Netezza laid claim to the term Asymmetrical MPP, or ASMPP, for short. The context in which they use the term is not pejorative.

After I blog about MPP asymmetry, or what I refer to as Asymmetrical MPP, the term will be clearly pejorative—but the reasons will have nothing to do with Netezza. I have no intention of mentioning that particular technology at all going forward.

I think I’ll close this little update with a photo of one of the fish I caught during one of my last weekend outings before joining EMC. A photo of a fish is off-topic so I’ll put it under that page and offer a quick link here:

It’s just a fish.

EMC World 2011? What A Waste Of Time!

EMC World 2011

Waste of time? Absolutely not! It won’t be a waste of time…certainly not for me. I am going and looking forward to it. I’m motivated to learn more about Cloud computing infrastructure and am super interested in seeing the interesting things that are going on with VMware. These are two technology trends I’m very interested in. The way I look at virtualization is two-fold: a) it solves a lot of problems and b) pretty much everything will eventually be running virtualized so I want to be ahead of the curve. There’s no need to be in denial.

Along those lines, I invite folks to read the following paper:

Oracle E-Business Suite on EMC Unified Storage with VMware Virtualization

I believe there is a session on this at EMC World. If so, I plan to attend. One of the things that impresses me the most about the project behind this paper is the proof-positive that  that EMC IT actually “eats their own dog-food” as the cliché goes. I’m tired of the “Do as I say, not as I do” crowd.

Perhaps some of you are going to EMC World as well? If so it’ll be a pleasure to meet up.

16GFC Fibre Channel is 16-Fold Better Than 4GFC? Well, All Things Considered, Yes. Part I.

16GFC == 16X
If someone walked up to you on the street and said, “Hey, guess what, 16GFC is twice as fast as 8GFC! It’s even 8-fold faster than what we commonly used in 2007” you’d yawn and walk away.  In complex (e.g., database) systems there’s more to it than line rate. Much more.

EMC’s press release about 16GFC support effectively means 16-fold improvement over 2007 technology. Allow me to explain (with the tiniest amount of hand-waving).

When I joined the Oracle Exadata development organization in 2007  to focus on performance architecture, the state of the art in enterprise storage for Fibre Channel was 4GFC. However, all too many data centers of that era were slogging along with 2GFC connectivity (more on that in Part II). With HBAs plugged into FSB-challenged systems via PCIe it was uncommon to see a Linux commodity system configured to handle more than about 400 MB/s (e.g., 2 x 2GFC or a single active 4GFC path). I know more was possible, but for database servers that was pretty close to the norm.

We no longer have front-side bus systems holding us back*. Now we have QPI-based systems with large, fast main memory, PCI 2.0 and lots of slots.

Today I’m happy to see that 16GFC is quickly becoming a reality and I think balance will be easy to achieve with modern ingredients (i.e., switches, HBAs). Even 2U systems can handily process data flow via several paths of 16GFC (1600 MB/s). In fact,  I see no reason to shy away from plumbing 4 paths of 16GFC to two-socket Xeon systems for low-end DW/BI. That’s 6400 MB/s…and that is 16X better than where we were even as recently as 2007.

Be that as it may, I’m still an Ethernet sort of guy.  I’m also still an NFS sort of guy but no offense to Manly Men intended.

A Fresh Perspective
The following are words I’ve waited several years to put into my blog, “Let’s let customers choose.” There, that felt good.

In closing, I’ll remind folks that regardless of how your disks connect to your system, you need to know this:

Hard Drives Are Arcane Technology. So Why Can’t I Realize Their Full Bandwidth Potential?

* I do, of course, know that AMD Opteron-based servers were never bottlenecked by Front Side Bus. I’m trying to make a short blog entry. You can easily google “kevin +opteron” to see related content.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.