Archive for the 'oracle' Category



EU Clears Oracle to Buy Sun Microsystems

EU Clears Path for Oracle Sun Acquisition (Wall Street Journal).

Little Things Doth Crabby Make – Part XI. Gigabyte or Gibibyte? Apparently Both!

It’s been a while since I’ve posted an installment in my Little Things Doth Crabby Make series. While I’m not feeling particularly crabby, I have to admit that this one just got under my skin today.

Giga, Gibi, Giggly
I’m like most of you when it comes to computer capacity nomenclature. In spite of standards bodies I just can’t bring myself to think of a gigabyte as 10^9 bytes. The guys that sell memory still sell 2^30 bytes and processor caches are still sized in multiples of 2^20 bytes (a.k.a. old school Megabytes). But, according to standards bodies a gigabyte is 1000000000 bytes and what I refer to as a gigabyte is actually a gibibyte or 1073741824 bytes.

Network technology has always used the decimal nomenclature (e.g., megabit, gigabit) and since about the late 1990s hard drive manufactures switched to decimal. So, usually when we install something like a 600GB (gigabyte) drive and partition it we see 558 gibibytes such as this drive in the Sun Oracle Exadata Storage Server:

# cat /proc/partitions | grep sdh
   8   112  585531392 sdh
# bc -l
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
(585531392 * 1024) / (2^30)
558.40625000000000000000

Help Me Please, He’s Boring Me To Death!
I know, and I apologize. But let me just take one more moment of your time to show you what little thing hath crabby made. It’s all about the content in the following box. I tell dd(1) to write to disk exactly 1GB (a.k.a. “Kevin gigabyte”). That is, I want dd(1) to use a single call to LibC write to fill out 2^30 bytes. That works you know as long as you have address space and physical memory. Anyway, back to the topic. So, I set dd(1) out to create a 2^30 bytes, er, I mean 1GB file and so it did. But, it reported its success to me as if it was a real strict IEEE or networking sort of guy. In the box you’ll see that it reported it wrote 1.1GB and it did—1.1 gigabytes.

Sorry, I’m still boring you to death.

So the thing that made me crabby was the fact that I followed up the dd(1) command with an ls(1) command using the Linux –h option which, according to the manpage, reports sizes to me in “human readable” form. That’s OK with me since to me K,M,G are all powers of two and I’m a human. But I thought we were supposed to start rethinking our way to the decimal nomenclature. Hmmm. See for yourself:

# dd if=/dev/zero of=GB bs=1073741824 count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 2.4089 seconds, 446 MB/s
# ls -lh GB
-rw-r--r-- 1 root root 1.0G Jan  6 14:48 GB

Nah, that didn’t really make me that crabby. I do get tired of the decimal thing though. But worse than that is the schizophrenia dd(1) exhibits. The following is what really made me crabby:

$ dd if=/dev/zero of=1G bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.19684 seconds, 897 MB/s

Now that is a good one! It’s sadly schizophrenic but entertaining.

Hot Oracle-Related Search Engine Terms For 2009? Carry On Wayward Googler!

2009 is drawing to a close so I just took a gander at the year-end search term statistics for my blog. It’s interesting to see the search terms that send readers to this blog. Some are surprising, others not so surprising. So, for trivial pursuit sake, the following list shows the top 50 search terms that resulted in a little over 30,000 click-throughs in CY2009. Quasi-interesting.

Search Term Visits
kevin closson 5,612
site:kevinclosson.wordpress.com direct i 2,332
cpuid linux 2,258
svr4 error: 25: inappropriate ioctl for 1,693
oracle exadata 1,460
linux cpuid 1,279
oracle orion 1,276
exadata 1,221
oracle nfs 1,054
oracle 10.2.0.4 970
kevin closson blog 872
gridsql 819
oracle rhel5 704
oracle alert log 588
ges: potential blocker 564
cpuid for linux 515
rhel5 oracle 510
prot-1: failed to initialize ocrconfig 455
sun flashfire 440
hugepages 393
oracle disk manager 393
closson wordpress 382
site:kevinclosson.wordpress.com direct i 377
linux cpu id 371
oracle nfs mount options 319
fibre channel 311
sql server on linux 305
11g alert log 304
cpu id linux 298
nfs performance 295
os watcher 285
oracle secure files 278
sched_setaffinity 278
clusterdeconfig 277
t2000 oracle 275
orion oracle 272
selinux oracle 267
oracle numa 262
oracle blog 262
closson 261
crsd.bin 256
manly men 250
alert log location 242
10.2.0.4 241
solaris cp 237
shared appl_top 217
oracle t2000 217
dbwr 216
oracle on nfs 215
oracle database 10.2.0.4 214

Oracle Database File System (DBFS). It’s Not An “Exadata Feature.” – Part I.

Last summer I posted a blog entry about the Oracle Database 11g feature known as Database File System (DBFS). The blog entry I made on the topic was about how suitable DBFS is as a staging file system for ELT/ETL operations. Since then I’ve also offered a webcast covering some DBFS related topics. With all that, one would naturally presume DBFS is an Exadata-only feature. It is not. I do, however,  get frequent requests for Exadata-specific tuning tips for DBFS. There are none.

Deploying DBFS in an Exadata environment is exactly the same as with non-Exadata storage–just faster.

I do feel that DBFS is much more than just a curiosity. Indeed, a file system stored in a database? Really? Yes! What a twist though given the fact that it has always been the other way around!

I aim to make a few blog entries soon to offer some compelling reasons why Oracle shops might do well to look into the future for areas in which DBFS can help solve problems. Indeed, others in the blogosphere have caught on to the fact that DBFS is not an Exadata-only feature as demonstrated by Ronny Egner’s excellent blog post on DBFS.

I’d also like to draw attention to this excellent Power Point presentation on DBFS as well.

So, I’m going to conclude with this as Part I in a series where I’ll cover some setup, diagnostic and performance information about DBFS.

More Words About Oracle Direct NFS On Sun Storage 7410 And Chip Multithreading Technology (CMT) For Oracle Database 11g Release 2.

In my recent blog entry entitled Words About Oracle Direct NFS On Sun Storage 7410 And Chip Multithreading Technology (CMT) For Oracle Database 11g Release 2 I discussed how my old friend Glenn Fawcett and I are studying OLTP performance characteristics of dual-socket Xeon 5500 (Nehalem EP) versus the Sun T5240 (also dual-socket). After I set Glenn up with a nice flexible OLTP workload, Glenn started collecting results and analyzing Oracle Direct NFS (DNFS) performance. Now that he has finished testing with DNFS over 10GbE, he posted a fresh blog entry on the matter that can be found at the following link:

The previous installments on that thread can be found here:

Glenn measured some 90,000 random 8KB physical IOPS over 10GbE and over 1GB/s when scanning disk!

As I’ve been saying for quite some time, “NFS, it ain’t just for kids no more!”

Now, just in case anyone thinks we are wasting our time with some sort of “trick” workload, take my word for it—this workload is grueling. Although I haven’t seen the AWR report from the 90,000+ IOPS 10GbE run, I know enough about the T5240 and this workload to guesstimate that the logical I/O rate for SGA buffer hits would have been on the order of 250,000/second and the CPUs were likely about 60% utilized. I know Glenn reads my blog and he’ll see the trackback from this post so lets see if he can chime in here and correct me on that LIO, CPU guesstimate…Glenn?

Pardon Me, Where Is That Flash Cache? Part II.

In Part I of this series about Database Smart Flash Cache I laid a bit of groundwork on the topic of Database Flash Cache and how it differs from Exadata Smart Flash Cache. I’d like to continue this series by discussing some of the mechanisms associated with Database Smart Flash Cache. But first, I’d like to offer a little more background in case the architecture of Database Smart Flash Cache is still confusing.

Fast Forward To The Past
During the mid-to-late 1990s, and earlier part of this decade,  DBAs would routinely find themselves managing 32-bit Oracle databases on systems with very large physical memory and a 64-bit Operating System. These systems would support a system buffer cache limited only by physical memory. In order to help reduce physical I/O administrators would sometimes use buffered file system files for Oracle. Oracle I/Os were therefore DMAed in the system buffer cache and then copied in kernel mode into the address space of the calling Oracle process. Re-reads from the system buffer cache eliminated physical I/Os. This model came with significant penalties such as the fact that all data in the SGA buffer cache was also present in the system buffer cache (waste of memory). The memory copy operations from the system buffer cache into the SGA buffers (for every physical I/O) was quite a load on systems of the time period as well.

While working in Sequent Database Engineering in the 1990s I was one of the folks pushing for a model quite similar to how Database Flash Cache works to address Oracle caching needs.  Instead of files on Flash, large “buffer cache files” were to reside in main memory. The proposal was such that when a buffer aged from the SGA buffer pool it would be written  to the “buffer cache file.” Disk reads would occur directly into the SGA. If data happened to be in the “buffer cache file” it would copied from RAM (with a mapping operation) thus eliminating a physical I/O. That would have been much better than the double-buffering effect of using buffered file system files. As it turns out, it would have also been quite similar to Database Smart Flash Cache architecture.

None of that ever came to pass, however, because we devised and implemented Indirect Data Buffers as a much better approach than leveraging system buffer cache or any other such external caching. However, that didn’t keep me from pursuing NUMA-optimized disk caching technology as my NUMA-optimized external cache patent sort of suggests.

What does this have to do with Database Smart Flash Cache?

Is that a Logical Read or a Physical Read? Yes!
The real goal of Database Smart Flash Cache is to reduce physical disk reads. It can’t do anything for writes. Most OLTP/ERP applications do more reading than writing so I think this is OK. The interesting turn on this is that physical disk reads from Database Smart Flash Cache occur in the logical I/O code path.

A logical I/O that hits a buffer previously aged to the Flash Cache requires a physical read to bring the buffer back into the address space of the process. Say what? This technology puts a physical read in the logical read code path? Yes.

Think of it this way. If a process doesn’t find the block it needs in the buffer cache, in a non-Database Smart Flash Cache scenario, it has to perform a read from disk. In the case of Database Smart Flash Cache there is a miss in the first level SGA buffer pool (aka L1 SGA) followed by a hit in the second level SGA buffer pool and at that point there is a physical disk read from Flash. The read from Flash will be quite fast and in the case of a PCI Flash card there won’t even be interaction with a central disk controller since each PCI Flash card has its own built-in disk controller.

In the next installment in this series of blog posts I will cover a performance comparison of Database Flash Cache. However, I need to end this installment with a quick look into how the instance accesses the Database Smart Flash Cache file.

It’s Cache, But It’s A File
In the following box I’ll show the initialization parameters and DBWR open files for an instance that is using a Database Smart Flash Cache file.

NOTE: If you hover over the upper right hand side of the box you’ll see a widget that will allow you to view the box in wide screen

SQL> show parameter db_flash

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_flash_cache_file                  string      /home/cache/scratchfile
db_flash_cache_size                  big integer 31G

# ps -ef | grep dbw
oracle    6466     1  0 Dec07 ?        00:00:35 ora_dbw0_OLTP1
oracle    6468     1  0 Dec07 ?        00:00:39 ora_dbw1_OLTP1
oracle   10428     1  0 Oct20 ?        00:02:59 asm_dbw0_+ASM1
oracle   11733 26379  0 08:17 pts/12   00:00:00 grep dbw
# ls -l /proc/6466/fd | grep file
lrwx------ 1 oracle oinstall 64 Dec  7 19:47 22 -> /home/cache/scratchfile
lrwx------ 1 oracle oinstall 64 Dec  7 19:47 33 -> /home/cache/scratchfile

OK, so that showed us the the DBWR process for the OLTP1 instance has two file descriptors open on the Database Smart Flash Cache file. After starting an OLTP workload I straced DBWR to see how it was interacting with the Flash Cache file:


strace -p 6466 2>&1 | head -10
Process 6466 attached - interrupt to quit
pwrite(33, "\6\242\0\0\345]\0\0HP\316\252\0\0\2\4\371\376\0\0\1\0\0\0\345E\1\0\226L\316\252"..., 8192, 29339877376) = 8192
pwrite(33, "\6\242\0\0004o\0\0\247l\316\252\0\0\2\4\2404\0\0\2\0\0\0\351E\1\0005l\316\252"..., 8192, 29339811840) = 8192
pwrite(33, "\6\242\0\0\344\27\0\0_M\316\252\0\0\2\4t\262\0\0\1\0\0\0\345E\1\0\226L\316\252"..., 8192, 29339795456) = 8192
pwrite(33, "\6\242\0\0\2202\3\0\221R\316\252\0\0\2\4\330\244\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 29338681344) = 8192
pwrite(33, "\6\242\0\0\310S\0\0\17O\316\252\0\0\2\4\262z\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 29338583040) = 8192
pwrite(33, "\6\242\0\0[\236\7\0GU\316\252\0\0\2\4.\301\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 29338492928) = 8192
pwrite(33, "\6\242\0\0\371\277\2\0\364]\316\252\0\0\2\4\371\315\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 29336969216) = 8192
pwrite(33, "\6\242\0\0\271\353\0\0\332e\316\252\0\0\2\4\220\262\0\0\2\0\0\0\350E\1\0\247e\316\252"..., 8192, 29336190976) = 8192
pwrite(33, "\6\242\0\0003F\v\0\227[\316\252\0\0\2\4~*\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 31513272320) = 8192

Well, that’s not good! DBWR is performing synchronous writes to the Flash Cache file. No mystery, I forgot to set filesystemio_options=setall. After doing so, DBWR uses libaio calls to spill into the Flash Cache file (the 5th word in each iocb is the file descriptor). The following snippet shows a blast of 165 writes to the Flash Cache file:


io_submit(47959748861952, 165, {{0x2b9e7fb02630, 0, 1, 0, 33}, {0x2b9e7faa3c70, 0, 1, 0, 33}, {0x2b9e7fa545f8, 0, 1, 0, 33}, {0x2b9e7fb91a70, 0, 1, 0, 33},

{0x2b9e7fa276f0, 0, 1, 0, 33}, {0x2b9e7faf68f8, 0, 1, 0, 33}, {0x2b9e7fafcfb0, 0, 1, 0, 33}, {0x2b9e7fc09f10, 0, 1, 0, 33}, {0x2b9e7fb8b920, 0, 1, 0, 33},

{0x2b9e7fa91880, 0, 1, 0, 33}, {0x2b9e7fb05170, 0, 1, 0, 33}, {0x2b9e7fb80c20, 0, 1, 0, 33}, {0x2b9e7faa6d18, 0, 1, 0, 33}, {0x2b9e7faba140, 0, 1, 0, 33},

[ .. many lines deleted ...]

{0x2b9e7fbb6d20, 0, 1, 0, 33}, {0x2b9e7faa1c00, 0, 1, 0, 33}, {0x2b9e7faeabc0, 0, 1, 0, 33}, {0x2b9e7fa7f490, 0, 1, 0, 33}, {0x2b9e7fb8a380, 0, 1, 0, 33},

{0x2b9e7fad5c90, 0, 1, 0, 33}}) = 165

So we’ve seen the mechanisms used by DBWR to spill into the Flash Cache file. What about foreground processes? The following shows a shadow process performing 8KB synchronous reads from the Flash Cache file. These pread() calls are physical I/O and all, are actually in the extended code path of  logical I/O. We are talking about cache hits after all and the is an SGA LIO.

$ ls -l /proc/32719/fd | grep scratch
lrwx------ 1 oracle oinstall 64 Dec 10 09:19 14 -> /home/cache/scratchfile
[oracle@dscgif05 dbs]$ strace -p 32719 2>&1 | grep pread | head -20
pread(14, "\6\242\0\0\256\215\1\0\337e\316\252\0\0\2\4\304\34\0\0\2\0\0\0\350E\1\0\316e\316\252"..., 8192, 23248191488) = 8192
pread(14, "\6\242\0\0\204\31\6\0\366\367\\\253\0\0\1\4z\321\0\0\1\0\0\0\340E\1\0\366\367\\\253"..., 8192, 17962737664) = 8192
pread(14, "\6\242\0\0\202:\1\0\366S\316\252\0\0\2\4\25\301\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 32084803584) = 8192
pread(14, "\6\242\0\0\204V\5\0\346\272\\\253\0\0\1\4\207\247\0\0\1\0\0\0\340E\1\0\346\272\\\253"..., 8192, 27747041280) = 8192
pread(14, "\6\242\0\0\337\305\0\0\337e\316\252\0\0\2\4ey\0\0\2\0\0\0\350E\1\0ne\316\252"..., 8192, 30495244288) = 8192
pread(14, "\6\242\0\0= \5\0\2035h\253\0\0\1\4)\223\0\0\1\0\0\0\340E\1\0\2035h\253"..., 8192, 20019929088) = 8192
pread(14, "\6\242\0\0\223\v\1\0,T\337\252\0\0\1\6\316\207\0\0\1\0\0\0\342E\1\0+T\337\252"..., 8192, 28805251072) = 8192
pread(14, "\6\242\0\0\214\256\0\0\332e\316\252\0\0\2\4\5a\0\0\2\0\0\0\350E\1\0ne\316\252"..., 8192, 31322808320) = 8192
pread(14, "\6\242\0\0\241w\4\0#\232m\253\0\0\1\4\334A\0\0\1\0\0\0\340E\1\0#\232m\253"..., 8192, 31416205312) = 8192
pread(14, "\6\242\0\0\304<\1\0\374S\316\252\0\0\2\4>\257\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 24838955008) = 8192
pread(14, "\6\242\0\0`\201\1\0\332e\316\252\0\0\2\4=*\0\0\2\0\0\0\350E\1\0\316e\316\252"..., 8192, 21697290240) = 8192
pread(14, "\6\242\0\0z\300\5\0\341Tl\253\0\0\1\4\365\315\0\0\1\0\0\0\340E\1\0\341Tl\253"..., 8192, 21481119744) = 8192
pread(14, "\6\242\0\0O\243\1\0\365e\316\252\0\0\2\4\261!\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 22630391808) = 8192
pread(14, "\6\242\0\0\221Z\r\0\240B]\253\0\0\1\4zK\0\0\1\0\0\0\340E\1\0\240B]\253"..., 8192, 32498204672) = 8192
pread(14, "\6\242\0\0\222Z\r\0\t\335g\253\0\0\1\4$\316\0\0\1\0\0\0\340E\1\0\t\335g\253"..., 8192, 25871138816) = 8192
pread(14, "\6\242\0\0o\245\1\0\365e\316\252\0\0\2\4s\272\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 30909734912) = 8192
pread(14, "\6\242\0\0\373i\r\0\211\10b\253\0\0\1\4.q\0\0\1\0\0\0\30E\1\0\211\10b\253"..., 8192, 27650859008) = 8192
pread(14, "\6\242\0\0\374i\r\0kTc\253\0\0\1\4v0\0\0\1\0\0\0\340E\1\0kTc\253"..., 8192, 18883878912) = 8192
pread(14, "\6\242\0\0t\357\2\0\2678\344\252\0\0\1\6R3\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 23501406208) = 8192
pread(14, "\6\242\0\0003\247\1\0\370e\316\252\0\0\2\4\314\35\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 27493490688) = 8192
$ ps -ef | grep 32719 | grep -v grep
oracle   32719 32700  4 09:17 ?        00:00:05 oracleOLTP1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

Summary
This post in the series aimed to show some of the mechanisms involved when processes interact with the Database Smart Flash Cache file. In Part III I’ll be sharing some performance numbers.

Pardon Me, Where Is That Flash Cache? Part I.

Flash is Flash is Flash, Right?

I was recently catching up on my reading of Jonathan Lewis’ blog when I read his post about Flash Cache. Jonathan’s post contained a few comments on the matter and then refers to Guy Harrison’s post on the topic.

After reading both of those posts I am now convinced more than ever that there is significant confusion over Flash Cache where Oracle Database 11g Release 2 is concerned. Both Jonathan and Guy state that Database Flash Cache (a.k.a. Database Smart Flash Cache) is an Exadata-only feature liberated for use in non-Exadata environments. I need to point out that this is not correct. However, when extremely smart folks are getting it wrong I think it is safe to say that there must not be enough information available.

Distant Cousins

From the initial release of Oracle Database 11g Release 2 the Database Flash Cache feature has had nothing in common with Exadata Smart Flash Cache other than Flash technology happens to be at the very center. In fact, it would be quite difficult for them to be any more dissimilar than they are. Put simply:

  • Database Flash Cache is an extension of the SGA that resides in Flash. Only data buffered in the SGA can make its way into the Database Flash Cache. The data is essentially “aged” out and written by DBWR into the Flash Cache device file.
  • Exadata Smart Flash Cache is PCI Flash  in Exadata Storage Server used for intelligent, adaptive cache. That is to say that regardless of whether the primary buffering of data is in the SGA or PGA, Exadata Smart Flash Cache is down-wind caching data based upon demand.

The words of the Oracle Database 11g New Features list introduces Database Flash Cache as follows:

Database Smart Flash Cache

Database Smart Flash Cache is an optional memory component that you can add if your database is running on Solaris or Oracle Enterprise Linux. It is an extension of the SGA-resident buffer cache, providing a level 2 cache for database blocks. It can improve response time and overall throughput.

As for Exadata Smart Flash Cache, there is a good white paper on the topic available here.

Part II can be found here.

Out of Memory Errors With Oracle Database On RHEL 4.8 Or OEL 4.8? Freeing Memory Perhaps?

In my recent post about free versus reclaimable memory on Linux ( Linux Free Memory, Is It Free Or Reclaimable?) I discussed a technique to quickly shuffle all pages containing clean, buffered file system pages to the free list. If you are interested, please see that post for an explanation for why I would mess with free versus reclaimable pages.  The following is a quick reminder of what /proc/sys/vm/drop_caches is used for:

To clear clean pages from the pagecache:

  • echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes (cached filesystem metadata):

  • echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:

  • echo 3 > /proc/sys/vm/drop_caches

The drop_caches feature was mainlined in the Linux 2.6.16 kernel and thus has been around since about 2006. Using drop_caches should in no way impact dirty data. As such, this should be a totally benign operation. In my experience, it is. I do it all the time with Oracle Database instances running and even while transactions are running. I don’t do it for reasons any production Database Administrator would, but I do it nonetheless. I was, therefore, very surprised to receive feedback from a reader who tried this experiment on an Enterprise Linux RHEL4 system. Now, I was surprised to hear that drop_caches was even available on OEL 4.8 (or RHEL 4.8 for that matter)  since it is 2.6.9 based. I have not touched a RHEL4 or EL4 system in quite some time, but it looks like drop_caches was back-ported from 2.6.16 to the RHEL4 kernel at some point. The reader was testing on OEL version 4.8. His comment went as follows:

Here’s an interesting outcome. On a 4.8 OEL system with about 4 databases running (non-prod). I did a sync, then echo 3>/proc/sys/vm/drop_caches. Looked at free and I got the expected result. However, all the DB’s then returned this error when connecting to them:

ERROR:
ORA-01034: ORACLE not available
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory
Additional information: 1
Additional information: 262149

Interesting, indeed! The reader (Mike) and I tries to investigate this a bit but ran into a wall since none of his OEL 4.8 systems had the strace utility installed. I cannot imagine what sort of memory allocation error this is, but suffice it to say that freeing up free memory resulting in a memory allocation problem is quite an unexpected result.

If anyone has an OEL (or RHEL 4.8) test system with Oracle running and would care to strace for us we’d appreciate it. Simply clear the page cache with the echo into drop_caches and see if you too suffer this connection-time error. If so, simply do it again as follows:

$ strace –o /tmp/bizzare.txt –f sqlplus scott/tiger

I’d love to get my hands on such a bizzare strace file!

Announcement: Hotsos Symposium 2010!

Hotsos Symposium is considered by many to be one of the best Oracle-related educational events each year. This is just a quick blog entry to direct attention to this year’s event. Tom Kyte will be this year’s Keynote and  Tanel Poder is hosting Training Day 2010. In fact, it looks like a lot of OakTable Network Members will be there.

The current list of speakers can be found here.

I’ll be speaking at Hotsos Symposium 2010 as well, but I haven’t finished ironing out the content of my session just yet. I have my ideas for content.

Words About Oracle Direct NFS On Sun Storage 7410 And Chip Multithreading Technology (CMT) For Oracle Database 11g Release 2.

My old friend Glenn Fawcett and I have been teaming up on Sun Oracle Database Machine DW/BI/OLTP performance engineering work, but last week Glenn told me he had an opportunity to put a late model Sun T5240 CMT system through the OLTP wringer. After consulting with me about workload options I set Glenn up with one of my favorite Pro*C Order Entry OLTP workload kits (no, it’s not TPC-C). We thought it would be quite interesting to compare a T5240 to a Sun Fire 4100 (the compute nodes in a Sun Oracle Database Machine) in a head to head test. Nothing official, just pure science. That work is still underway.

For new reader’s sake, I’d like to point out an old blog entry I made (circa 2006) after seeing a lot of TechMarketingBafoonery™ slung about by Sun competitors suggesting the CoolThreads SPARC architecture is unfit for Oracle because it is has only a single floating point engine per socket. That blog entry can be found here:

Marketing Efforts Prove Sun Fire T2000 Is Not Fit For Oracle Database Processing

What Storage To Test With?
The storage option that became available to Glenn was a Sun 7000 NAS device (a.k.a., project Fishworks). I recommended Glenn set up Direct NFS in order to efficiently drive the storage. I see that Glenn has posted some Solaris Direct NFS related information as a result of his efforts over the last couple of days. I recommend Glenn’s post about Solaris Direct NFS Configuration and Verification and Glenn’s new fledgling blog in general.

That Oracle Over NFS Stuff Is So Avant Garde
I’d like to point out, for new blog readers, that I have been a long-standing advocate of Oracle over NFS as a storage solution. The following posts are a good place to start if you want some food for thought on why Oracle Direct NFS is a good storage protocol and Oracle over NFS is a good storage architecture in general:

And, CFS, NFS Topics here.

Linux Free Memory: Is It Free Or Reclaimable? Yes. When I Want Free Memory, I Want Free Memory!

I have certain performance test harnesses that expect to have a lot of free memory. The test harnesses produce a lot of output that facilitate my performance analysis efforts.  The output of these harnesses  is mostly in the form of text files and pipe-delimited files (to be used by SQL*Loader) and CSV files which I upload to Excel.

Sometimes I execute these performance harnesses for long periods of time in which case I can generate a significant amount of file system content. The problem is that I wind up with little or no free memory.

I Want Free Memory, Not Reclaimable Memory
On Linux (as is the case with every Unix derivation) user memory allocations (e.g., stack/heap growth) can be satisfied by the OS through page reclaims. Page reclaims are simply reuse of memory pages that have clean content such as pages of text files read from the file system. A reclaimable page can be matched to a process request for memory extremely fast, I’ll agree. However, there is still a difference between free memory and reclaimable memory. Whether that difference is no more significant than the output of such tools as free(1) is not the point of this blog entry. In fact, if you want to see the “buffer adjusted” free memory on Linux you can use –o argument to the free(1) command. The problem for me can sometimes be that I don’t want to munge through scripts and programs that expect to start with a lot of free memory and change them to weed out the reclaimable effect.

If you want to effectively clean out physical memory of all the cached file system dead wood, and you have proper permissions, you can write values into /proc/sys/vm/drop_caches and get the results you are looking for.

In the following example I have a 72GB physical memory system that is totally idle. There is a single instance of Oracle Database 11g Release 2 booted with an SGA of roughly 22GB. There are no significantly large processes running either. I’m basically sitting on about 50 GB of cached file system “stuff” that I don’t want cached. As the example shows I’ve echoed the numeral 3 into drop_caches and the subsequent execution of free(1) shows the 50 GB of cached file system junk is now wiped out and shown under the “free” column.

# free
             total       used       free     shared    buffers     cached
Mem:      74027752   73688248     339504          0     528164   62777200
-/+ buffers/cache:   10382884   63644868
Swap:     16779884     954164   15825720
# echo 3 >/proc/sys/vm/drop_caches
#
# free
             total       used       free     shared    buffers     cached
Mem:      74027752   22956444   51071308          0       2144   13320800
-/+ buffers/cache:    9633500   64394252
Swap:     16779884     954164   15825720
#

Of course I understand that this is just a shuffle in memory accounting by the kernel, but for certain scripting purposes this might be helpful for someone searching the web someday.

Oracle Database 11g Release 2 Includes The Orion I/O Test Tool, But You Better Get That Full Path Name Right.

As Miladin Modrakovic recently pointed out, Oracle Database 11g Release 2 includes, for the first time ever, a copy of the ORacle IO Numbers (ORION) disk throughput test tool. The executable has been put into $ORACLE_HOME/bin.
This is just a quick blog entry to point out that the executable does work, but I had to file a bug since it only functions if executed with a full path name. At least that is the case on Linux, I can’t try it on Solaris.

This one goes out to the wayward Googler looking for the search terms error 27155 at location skgpchild2.

The following shows the diagnostic for why a relative path name fails and a full path name succeeds. When invoked without a full path name, the second invocation of execve(2) (called by a child thread) fails ENOENT. The last set of strace(1) output shows that the full path name is the fix.

Note, you can hover over the following box and an icon will appear that allows you to open the text in a better viewer.

 

$ uname -s -r -v -m -p -i -o
Linux 2.6.18-128.1.16.0.1.el5 #1 SMP Tue Jun 30 16:48:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
$ type orion
orion is hashed (/u01/app/oracle/product/11.2.0/dbhome_1/bin/orion)
$ orion -run normal -testname ext3 -num_disks 1
ORION: ORacle IO Numbers -- Version 11.2.0.1.0
ext3_20091106_1024
Calibration will take approximately 19 minutes.
Using a large value for -cache_size may take longer.

Error in spawning child process
Additional information: : error 27155 at location skgpchild2 - execv() - No such file or directory
orion_main: orion_spawn sml failed
Test aborted due to errors.
$
$ strace -o strace.out -f orion -run normal -testname ext3 -num_disks 1
ORION: ORacle IO Numbers -- Version 11.2.0.1.0
ext3_20091106_1025
Calibration will take approximately 19 minutes.
Using a large value for -cache_size may take longer.

Error in spawning child process
Additional information: : error 27155 at location skgpchild2 - execv() - No such file or directory
orion_main: orion_spawn sml failed
Test aborted due to errors.
$ grep execve strace.out
25824 execve("/u01/app/oracle/product/11.2.0/dbhome_1/bin/orion", ["orion", "-run", "normal", "-testname", "ext3", "-num_disks", "1"], [/* 31 vars */]) = 0
25825 execve("orion", ["orion", "_ranseq", "0", "33010150\370x%;\377\177"], [/* 34 vars */]) = -1 ENOENT (No such file or directory)
$
$ strace -o strace.out -f $ORACLE_HOME/bin/orion -run normal -testname ext3 -num_disks 1
ORION: ORacle IO Numbers -- Version 11.2.0.1.0
ext3_20091106_1026
Calibration will take approximately 19 minutes.
Using a large value for -cache_size may take longer.

$ grep execve  strace.out
26223 execve("/u01/app/oracle/product/11.2.0/dbhome_1/bin/orion", ["/u01/app/oracle/product/11.2.0/d", "-run", "normal", "-testname", "ext3", "-num_disks", "1"], [/* 31 vars */]) = 0
26224 execve("/u01/app/oracle/product/11.2.0/dbhome_1/bin/orion", ["/u01/app/oracle/product/11.2.0/d", "_ranseq", "0", "1771981131\33\351\377\177"], [/* 34 vars */]) = 0

Sun Oracle Database Machine Video Demonstration…Again.

This is just a quick blog entry to point out that the video (available here) I blogged about in my recent post entitled Announcing a YouTube Video Demonstration of Sun Oracle Database Machine is once again online. It had been pulled by the owner for some rough-edge editing.

Sun Oracle Database Machine: The Million Oracle Database IOPS Machine or Marketing Hype? Part II.

In my recent post entitled Sun Oracle Database Machine: The Million Oracle Database IOPS Machine or Marketing Hype? Part I, I started a discussion about why this million IOPS-capable platform is interesting to Oracle deployments that don’t require quite (or nowhere near) that much I/O. Since I, more or less, dismissed the idea that there are but a handful of applications in production that require 1 million IOPS it may seem on the surface as though I am in disagreement with my friend Glenn Fawcett’s post regarding the physical drive savings with Sun Oracle Database Machine. Glenn writes:

Consider for a moment the number of drives necessary to match the 1 million IOPS available in the database machine. Assuming you are using the best 15,000 rpm drive, you would be able to do 250 IOPS/drive. So, to get to 1 million IOPS, you would need 4,000 drives! A highly dense 42U storage rack can house any where from 300-400 drives. So, you would need 10 racks, just for the storage and at least one rack for servers.

I agree with Glenn’s math in that post, in so far as the fact that Sun Oracle Database Machine can match the read IOPS of the 4,000 drives Glenn speaks of. However, I still hold fast that any production site with 4,000 spindles provisioned specifically to meet an IOPS requirement of a single production database is a rarity. So, does that mean I am dismissing the value of Sun Oracle Database Machine? No.

Consolidation, Again
Both Glenn and I have spoken about applying Sun Oracle Database Machine for database consolidation for good reason. I think it would be difficult to win CIO mind share with a million IOPS value proposition when the same CIO is probably quite aware  that his entire enterprise data center IOPS load comes nowhere near that amount. Let me see if I understand enough about Sun Oracle Database Machine, and what goes on in real life production data centers, well enough to help make some sense of what may appear to be an overly capable platform.

Consolidating Chaos Into…Chaos?
The value proposition supporting database consolidation centers on reducing chaos. It all comes down to breaking the relationship of 1 Operating System per Oracle Database. Consider, for example, 32 hypothetical Oracle Database instances deployed in the standard 1:1 (OS:DB) deployment model (let’s leave high availability out of the equation for a moment). These hypothetical database instances would require 32 systems to maintain. What if the  systems are, say, 3 years old and each consist of 2 socket, multi-core processors of that time period. It is quite likely that the same 32 databases could be deployed in a single Sun Oracle Database Machine and experience significant performance improvement. What does this have to do with chaos, IOPS and FLASH you ask?

Real Grid Infrastructure Makes For Good Consolidation
In my view, consolidating those hypothetical 32 instances into a database grid of 8 hosts (all Real Application Clusters-ready) would reduce a lot of chaos because you wouldn’t have 32 pools of storage to manage.

Disk space in the Sun Oracle Database Machine can simply be provisioned from one easily managed storage pool (Automatic Storage Management disk group). That seems a lot simpler to me as does maintaining 24 fewer OS images. Other consolidation considerations include the ability to run both RAC and non-RAC instances in the Database Machine. This differs from provisioning discrete systems some of which being RAC-ready (e.g., provisioned shared storage and interconnects configured, etc) and others not. With the Database Machine it is a simple task to switch a database from non-RAC to RAC because all the requirements are in place. Another thing to consider about consolidation of this type is the fact that any of the databases can run on any of the hosts. The hosts serve as a true grid of resources. I know folks speak of grid computing often, but the 32 servers with their 32 pools of storage really don’t fit the definition of a grid any more than would 32 2-node RAC clusters each with their own pool of storage. Once the storage is central and shared, and all hosts interconnected, you have a grid. But what does that have to do with IOPS and FLASH you ask?

“I Don’t Need One Million IOPS. I Need What I Need When I Need It, But Usually Don’t Get It Anyway”
Let’s say you’re a lot like the hypothetical 32-host, 32-database DBA. Isn’t it quite a task keeping up with which of those databases demand, say, 2,000 IOPS per processor core (e.g., 8,000 for 4-core a server) and which ones are more compute intensive demanding only 200 IOPS per processor core? So, what do you do? Do you argue your case for pools of  “common denominator storage” that can handle the 8,000 IOPS case or just suffer through with something in the middle? How much waste does that lead to? How poorly under-provisioned are your heavy I/O database instances?

Consolidating databases into a Sun Oracle Database Machine allows IOPS-hungry applications to scale up to 8 nodes and 1 million read IOPS with RAC. Conversely, there are 8 125,000 IOPS-capable (approximate) units to be provisioned according to multiple database needs. For instance, several compute-light but IOPS-intensive databases could likely be hosted in a single database server in the Database Machine since there is a demonstrated 125,000 IOPS worth of bandwidth available to each host. That’s over 15,000 per processor core. Quick, run off to your repository of AWR reports and see how many of your databases demand 15,000+ IOPS per processor core! Now, while I have you thinking about AWR reports, do make sure to ascertain your read:write ratio. As I pointed out in Part I of this series the datasheets are quite clear on how the Sun Oracle Database Machine can service 1,000,000 read IOPS but is limited to 50,000 gross write IOPS in a full-rack configuration. The Exadata Smart Flash Cache adds no value to writes. Also, from that 50,000 write IOPS comes the overhead of mirrored writes so a full-rack Sun Oracle Database Machine has the capacity to service 25,000 writes per second a ratio of (40:1).

IOPS, SchmIOPS. I Care About Latency!
Let’s look at this from a different angle. What if you have a database that doesn’t require extreme read IOPS but requires very low latency reads served from a data set of, say, 1 TB. Imagine further that this latency-sensitive database isn’t the only database you have. Imagine that! Today’s DBA managing more than a single Oracle Database! Well, your world today is full of difficulty. Today you have pools of storage fed to you from the “Storage Group” which may or may not even satisfy the requirements of your run-of-the-mill databases more less this hypothetical 1 TB latency-sensitive database. What to do? Round it all up and consolidate the whole set of databases into a Database Machine. The lower-class “citizens” can be stacked together inside one or a few database hosts and their I/O demand controlled with Exadata I/O Resource Management (IORM). The latency-sensitive 1 TB database (be it single instance or RAC), on the other hand, can operate entirely out of FLASH because the architecture of Sun Oracle Database Machine is such that the entire aggregate FLASH capacity is available to all databases in the grid. That is, databases don’t have to be scaled with RAC to have access to all FLASH capacity. So, that latency-sensitive database can even grow to 5 TB and still be covered with FLASH and further, it can grow to become more processor-intensive as well and scale with RAC to multiple instances without procuring a new cluster.

As and aside, it is nearly impossible to control the I/O demand of hosted databases in any other consolidation scheme since there would be is no IORM. In those other schemes you can certainly control the amount of CPU and memory a database is allocated, but it doesn’t take much CPU, or memory, to put significant strain on the central I/O subsystem if things go awry (e.g., run away queries, application server flooding, etc). If you don’t know what I’m talking about simply examine how little CPU something like Orion requires while obliterating an I/O subsystem.

Fixed FLASH Assets Fixes All? Far-Fetched!
There are storage vendors out there stating that you can reach parity with the Database Machine by simply plugging in one of their FLASH devices. I won’t argue that it is possible to work out the system and storage infrastructure necessary for a million IOPS.  That rate is feasible even with Fibre Channel, although it would require some 20 active 4GFC paths (given an Oracle data block of 8 KB) from some number of hosts.

No, I’m not as excited about the IOPS capability of the Database Machine as much as the FLASH storage-provisioning model it offers. In spite of all the hype, the Database Machine IOPS story is every bit as much a story of elegance as it is brute force. Allow me to explain.

If you want to apply the power of FLASH with most conventional storage offerings you have to use FLASH as ordinary disks. That requires, for availability sake, mirroring. Further, you have to figure out which portions of your database contain the high-I/O rate objects and commit them permanently to FLASH. No big deal, I suppose, if you choose a 100% FLASH storage approach. I think doing so is unlikely the case though. What most databases have are hot spots and those must go into FLASH while other parts of the database remain on spinning disk. Well, there is a problem with that. Hot spots move. So now you find yourself shuffling portions of your database in and out of FLASH disks—manually. That’s messy at best.  What if you have databases that only occasionally go critical? Do you provision to them permanent, mirrored FLASH disk as well?

The FLASH cache in Sun Oracle Database Machine is dynamic. Data flows through the cache where hot stuff stays and cold stuff leaves (based on capacity). What does this have to do with consolidation? Well, some of your databases can be IOPS hungry, others not, and those personalities can shift without notice. That sort of dynamism is a real headache with conventional systems. With the Database Machine, on the other hand,  the problem sort of solves itself.

There is a reason I use the phrase “Million IOPS Capable.”  I think the term “Million IOPS Machine” presumes too much while insinuating far too little.

That’s my opinion.

HP Proliant DL180 G5 With Nehalem EP and PC2-5300 DDR2 Memory?

Please forgive…I just can’t resist a tongue-in-cheek post after seeing this. According to this HP webpage for the Proliant DL180 G5, there is a new flavor of server and new flavor of Nehalem processor for that matter. As the following screen shot shows the G5 now supports 2 Xeon 5500 (Nehalem) processors and further down in the page it suggests the CPUs are fed with up to 6 PC2-5300 DDR2 DIMMS.

All in the name of technohumor:

G5


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.