Archive Page 13

Oracle Database 11g Database File System (DBFS). Common DBFS Problems and Solutions. Part II.

I’m finding that one of the more confusing aspects of Oracle Database 11g Database File System (DBFS) for folks to understand is the difference between access and administration permissions.

The Oracle Databse 11g Database File System (DBFS) documentation discusses the “fuse” group, the allow_other DBFS mount option and the /etc/fuse.conf file but all this information doesn’t draw a clear distinction between administrative permissions and access permissions.

  • Administrative Permissions. All mention the fuse group (and any mention of permissions settings on /dev/fuse) is related to permissions necessary to administer the file system (e.g., mount and unmount file systems, etc)
  • Access Permissions. The allow_root and allow_other dbfs_client mount options are related to permissions to access the mounted file systems and contents of the file systems.

If the allow_other mount option is passed—when using dbfs_client to mount a file system—then other, ordinary Linux users can access file system contents as per the normal Linux file system permissions. Consider the following example. I’ll show that I have a DBFS file system mounted at /data. Next, as root, I’ll create a file called joe.txt and then modify its ownership to joe and group joe:

# mount | grep /data
dbfs on /data type fuse (rw,nosuid,nodev,max_read=1048576,default_permissions,allow_other,user=oracle)
# ls -ld /data /data/FS1 /data/FS1/joe.txt
drwxr-xr-x  4 root   root 0 Feb 10 17:30 /data
drwxrwxrwx 76 oracle dba  0 Feb 10 17:27 /data/FS1
-rw-r--r--  1 joe    joe  0 Feb 10 17:27 /data/FS1/joe.txt

Next, I’ll switch user (with login) to the user joe and list joe’s file. It is owned by joe and all directory hierarchy permissions back to /data permit joe’s access, so it should be no problem:

# su - joe
$
$ id
uid=1001(joe) gid=1005(joe) groups=1005(joe)
$ ls -l /data/FS1/joe.txt
ls: /data/FS1/joe.txt: Permission denied

OK, so something is obviously wrong. Yes, there is. It is a known bug (8976604) and there is a very simple work-around. I don’t like blogging about known bugs, but I think someone will likely find this via a Google search. Allow me to explain the work-around

The work-around for bug 8976604 is to supply to pass-through-fuse-options parameter at mount time. Consider the following example:

$ $ORACLE_HOME/bin/dbfs_client dbfs@ --pass-through-fuse-options -o allow_other,direct_io /data < /opt/oracle/dbfs/passwd.txt &
[1] 2931
$ Password:
$ su - joe
Password:
$
$ id
uid=1001(joe) gid=1005(joe) groups=1005(joe)$ ls -l /data/FS1/joe.txt
total 0
-rw-r--r--  1 joe    joe  0 Feb 10 17:27 /data/FS1/joe.txt

There, joe is happy now that he can see the file he owns.

Speaking at Hotsos Symposium 2010.

I haven’t made a blog entry in about two weeks but that is not due to of lack of topics. I get a constant flow of email from readers with requested topics that they’d like to see covered in this blog.

I’m speaking at Hotsos Symposium 2010 next month and my presentation consists of a deep dive into a lot of the sorts of topics readers ask me to blog about. I’ve posted the abstract below.  The abstract is also posted at the Hotsos Symposium 2010 speaker page.

According to the speaker schedule I’m presenting in a time slot adjacent to Tom Kyte’s session about PL/SQL. So, if you are one of the 3 or so people who, for one bizarre reason or another, decide not to see Tom, perhaps you can attend my session. We’ll all be able to stretch out as there will be plenty of room 🙂

Here’s the abstract:

Ten Years After Y2K And We Still “Party Like It’s 1999”

Whether you call it “the two-thousands”, the “Ohs”, “The Naughties” or “The Aughts”, the first decade of this millennium is over and it ushered in a significant amount of new technology related to Oracle. Don’t be alarmed, this is not one of those worthless technical chronology presentations. After all, is there anyone who isn’t aware that the decade started with the introduction of Real Application Clusters—and thus the demise of the large, central server—and finished with Oracle acquiring Sun Microsystems? This presentation has nothing to do with any of that! In spite of how much technology has changed, we really do still seem to be stuck in the 1990s. The following is sample of some of the topics I’ll be diving into—deeply.

  • We still think a CPU is a CPU.
  • We still think memory is memory.
  • We still think bulk data loading is a disk-intensive operation.
  • We still think direct-attached storage is for “small” systems.
  • We still think database==structured and file system==unstructured.
  • We still think NUMA is niche technology.
  • We still think NFS is a file serving protocol.

And, of course, I’ll be talking a bit about Oracle Exadata Storage Server and the Oracle Database Machine.

Oracle Database 11g Database File System (DBFS). Common DBFS Problems and Solutions. Part I.

In my recent post entitled Oracle Database 11g Database File System (DBFS) Is Not A Popular Topic. Yet., I discussed how I use Google search terms as a way of monitoring what sorts of problems people are running into with new technology such as (DBFS).

This is part one in a series of posts where I’ll offer some tips regarding the more common DBFS-related problems one might encounter in early testing. The first I’ll tackle is “Transport endpoint is not connected.”

Error Text: Transport Endpoint is not connected.
This error string occurs when attempting to access a mounted DBFS file system. The error condition arises when there is no dbfs_client currently executing that is associated with the file system.

In the following boxes I’ll show two examples of how this error can be encountered. First, I’ll show that I have a DBFS file system mounted and that an attempt to change directories to the mount point results in “Transport endpoint is not connected”, such as:

$ mount | grep dbfs
dbfs on /data type fuse (rw,nosuid,nodev,max_read=1048576,default_permissions,allow_other,user=oracle)
$ cd /data
-bash: cd: /data: Transport endpoint is not connected

A simple ps command shows there is no dbfs_client process executing:

$ ps -ef | grep dbfs_client
oracle    1694 21941  0 09:49 pts/11   00:00:00 grep dbfs_client

This error condition is quite simple to rectify. You simply need to restart the dbfs_client process:

$ export LD_LIBRARY_PATH=$ORACLE_HOME/lib
$ export ORACLE_SID=DBFS1
$ 
$ ORACLE_HOME/bin/dbfs_client dbfs@ -o allow_root,direct_io /data < /opt/oracle/dbfs/passwd.txt &
[1] 2501
$ fuse: bad mount point `/data': Transport endpoint is not connected

Sorry, I had to be sneaky in order to show one of the other conditions that raises the error. So, yes, to rectify this error condition you do have to restart dbfs_client, however, you need to first unmount the dead mount. You’ll see in the following box that I use fusermount –u to unmount and then execute dbfs_client and without any further action I can list the contents of the file system.

$ fusermount -u /data
$ $ORACLE_HOME/bin/dbfs_client dbfs@ -o allow_root,direct_io /data < /opt/oracle/dbfs/passwd.txt &
[1] 2931
$ Password:

$ ls -l /data/FS1/TEST
total 0
-rw-r--r-- 1 oracle oinstall 0 Dec 30 13:39 511b
-rw-r--r-- 1 root root 0 Dec 30 13:40 test.out

Dead Mount, Living DBFS Instance.
If you look in the boxes above you’ll notice that dbfs_client was not executing (the real root of the problem) but the instance was running. Since dbfs_client is an OCI program all I had to do was kill it to create this scenario. If, however, you suffer this error while dbfs_client is executing then I recommend involving support because that would be a bug.

Oracle Database 11g Database File System (DBFS) Is Not A Popular Topic. Yet.

I find that keeping my finger on the blogosphere pulse is a good way to ascertain the adoption rate of new Oracle features such as Oracle Database 11g Database File System (DBFS).  I see that Tim Hall has posted DBFS content in his typically excellent style!

I have a few posts on the topic of DBFS. Amongst the following list is a post (“Hidden Content?”) that will direct you to a webcast I delivered to IOUG Exadata Special Interest Group. The webcast is very informative because I introduce the concept of “injecting” files into the file system. Injecting? Yes, imagine transferring files into an Ext3 file system on a remote host without that file system even being mounted, anywhere! Cool technology and very efficient.

Google Search Terms
Another web aspect I monitor when it comes to new Oracle features is what search engine terms are being thrown at Google. For example, the following DBFS-related searches are likely search terms for folks that are testing the water with DBFS and/or suffering any problems getting it to work:

Search Term Hits
dbfs_client 370
dbfs_client +allow_root 8
dbfs_client +allow_other 9
DBFS +fusermount +oracle 1100
DBFS +direct_io 246
DBFS +exadata 14,700
–pass-through-fuse-options 0
“Transport endpoint is not connected” +DBFS 0
“fail to connect to database server” 4
“fail to connect to database server” +DBFS 0
dbfs_client +”fuse: failed to exec fusermount: Permission denied” 0
DBFS +”fuse: failed to exec fusermount: Permission denied” 0

I aim to post a few blog entries with troubleshooting tips for some of the more common DBFS-related problems that customers might hit. The first entry I’ll make will cover the more common “Transport endpoint is not connected” error string that is returned under certain situations when trying to access DBFS mounts.

So, yes, the title of this post was a come-on. Oracle Database 11g Database File System is a new feature. It takes time for the blogosphere to catch up. But, as I’ve pointed out, there are good, trustworthy, bloggers posting content…like Tim Hall.

EU Clears Oracle to Buy Sun Microsystems

EU Clears Path for Oracle Sun Acquisition (Wall Street Journal).

Little Things Doth Crabby Make – Part XI. Gigabyte or Gibibyte? Apparently Both!

It’s been a while since I’ve posted an installment in my Little Things Doth Crabby Make series. While I’m not feeling particularly crabby, I have to admit that this one just got under my skin today.

Giga, Gibi, Giggly
I’m like most of you when it comes to computer capacity nomenclature. In spite of standards bodies I just can’t bring myself to think of a gigabyte as 10^9 bytes. The guys that sell memory still sell 2^30 bytes and processor caches are still sized in multiples of 2^20 bytes (a.k.a. old school Megabytes). But, according to standards bodies a gigabyte is 1000000000 bytes and what I refer to as a gigabyte is actually a gibibyte or 1073741824 bytes.

Network technology has always used the decimal nomenclature (e.g., megabit, gigabit) and since about the late 1990s hard drive manufactures switched to decimal. So, usually when we install something like a 600GB (gigabyte) drive and partition it we see 558 gibibytes such as this drive in the Sun Oracle Exadata Storage Server:

# cat /proc/partitions | grep sdh
   8   112  585531392 sdh
# bc -l
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
(585531392 * 1024) / (2^30)
558.40625000000000000000

Help Me Please, He’s Boring Me To Death!
I know, and I apologize. But let me just take one more moment of your time to show you what little thing hath crabby made. It’s all about the content in the following box. I tell dd(1) to write to disk exactly 1GB (a.k.a. “Kevin gigabyte”). That is, I want dd(1) to use a single call to LibC write to fill out 2^30 bytes. That works you know as long as you have address space and physical memory. Anyway, back to the topic. So, I set dd(1) out to create a 2^30 bytes, er, I mean 1GB file and so it did. But, it reported its success to me as if it was a real strict IEEE or networking sort of guy. In the box you’ll see that it reported it wrote 1.1GB and it did—1.1 gigabytes.

Sorry, I’m still boring you to death.

So the thing that made me crabby was the fact that I followed up the dd(1) command with an ls(1) command using the Linux –h option which, according to the manpage, reports sizes to me in “human readable” form. That’s OK with me since to me K,M,G are all powers of two and I’m a human. But I thought we were supposed to start rethinking our way to the decimal nomenclature. Hmmm. See for yourself:

# dd if=/dev/zero of=GB bs=1073741824 count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 2.4089 seconds, 446 MB/s
# ls -lh GB
-rw-r--r-- 1 root root 1.0G Jan  6 14:48 GB

Nah, that didn’t really make me that crabby. I do get tired of the decimal thing though. But worse than that is the schizophrenia dd(1) exhibits. The following is what really made me crabby:

$ dd if=/dev/zero of=1G bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.19684 seconds, 897 MB/s

Now that is a good one! It’s sadly schizophrenic but entertaining.

Hot Oracle-Related Search Engine Terms For 2009? Carry On Wayward Googler!

2009 is drawing to a close so I just took a gander at the year-end search term statistics for my blog. It’s interesting to see the search terms that send readers to this blog. Some are surprising, others not so surprising. So, for trivial pursuit sake, the following list shows the top 50 search terms that resulted in a little over 30,000 click-throughs in CY2009. Quasi-interesting.

Search Term Visits
kevin closson 5,612
site:kevinclosson.wordpress.com direct i 2,332
cpuid linux 2,258
svr4 error: 25: inappropriate ioctl for 1,693
oracle exadata 1,460
linux cpuid 1,279
oracle orion 1,276
exadata 1,221
oracle nfs 1,054
oracle 10.2.0.4 970
kevin closson blog 872
gridsql 819
oracle rhel5 704
oracle alert log 588
ges: potential blocker 564
cpuid for linux 515
rhel5 oracle 510
prot-1: failed to initialize ocrconfig 455
sun flashfire 440
hugepages 393
oracle disk manager 393
closson wordpress 382
site:kevinclosson.wordpress.com direct i 377
linux cpu id 371
oracle nfs mount options 319
fibre channel 311
sql server on linux 305
11g alert log 304
cpu id linux 298
nfs performance 295
os watcher 285
oracle secure files 278
sched_setaffinity 278
clusterdeconfig 277
t2000 oracle 275
orion oracle 272
selinux oracle 267
oracle numa 262
oracle blog 262
closson 261
crsd.bin 256
manly men 250
alert log location 242
10.2.0.4 241
solaris cp 237
shared appl_top 217
oracle t2000 217
dbwr 216
oracle on nfs 215
oracle database 10.2.0.4 214

Oracle Database File System (DBFS). It’s Not An “Exadata Feature.” – Part I.

Last summer I posted a blog entry about the Oracle Database 11g feature known as Database File System (DBFS). The blog entry I made on the topic was about how suitable DBFS is as a staging file system for ELT/ETL operations. Since then I’ve also offered a webcast covering some DBFS related topics. With all that, one would naturally presume DBFS is an Exadata-only feature. It is not. I do, however,  get frequent requests for Exadata-specific tuning tips for DBFS. There are none.

Deploying DBFS in an Exadata environment is exactly the same as with non-Exadata storage–just faster.

I do feel that DBFS is much more than just a curiosity. Indeed, a file system stored in a database? Really? Yes! What a twist though given the fact that it has always been the other way around!

I aim to make a few blog entries soon to offer some compelling reasons why Oracle shops might do well to look into the future for areas in which DBFS can help solve problems. Indeed, others in the blogosphere have caught on to the fact that DBFS is not an Exadata-only feature as demonstrated by Ronny Egner’s excellent blog post on DBFS.

I’d also like to draw attention to this excellent Power Point presentation on DBFS as well.

So, I’m going to conclude with this as Part I in a series where I’ll cover some setup, diagnostic and performance information about DBFS.

More Words About Oracle Direct NFS On Sun Storage 7410 And Chip Multithreading Technology (CMT) For Oracle Database 11g Release 2.

In my recent blog entry entitled Words About Oracle Direct NFS On Sun Storage 7410 And Chip Multithreading Technology (CMT) For Oracle Database 11g Release 2 I discussed how my old friend Glenn Fawcett and I are studying OLTP performance characteristics of dual-socket Xeon 5500 (Nehalem EP) versus the Sun T5240 (also dual-socket). After I set Glenn up with a nice flexible OLTP workload, Glenn started collecting results and analyzing Oracle Direct NFS (DNFS) performance. Now that he has finished testing with DNFS over 10GbE, he posted a fresh blog entry on the matter that can be found at the following link:

The previous installments on that thread can be found here:

Glenn measured some 90,000 random 8KB physical IOPS over 10GbE and over 1GB/s when scanning disk!

As I’ve been saying for quite some time, “NFS, it ain’t just for kids no more!”

Now, just in case anyone thinks we are wasting our time with some sort of “trick” workload, take my word for it—this workload is grueling. Although I haven’t seen the AWR report from the 90,000+ IOPS 10GbE run, I know enough about the T5240 and this workload to guesstimate that the logical I/O rate for SGA buffer hits would have been on the order of 250,000/second and the CPUs were likely about 60% utilized. I know Glenn reads my blog and he’ll see the trackback from this post so lets see if he can chime in here and correct me on that LIO, CPU guesstimate…Glenn?

Pardon Me, Where Is That Flash Cache? Part II.

In Part I of this series about Database Smart Flash Cache I laid a bit of groundwork on the topic of Database Flash Cache and how it differs from Exadata Smart Flash Cache. I’d like to continue this series by discussing some of the mechanisms associated with Database Smart Flash Cache. But first, I’d like to offer a little more background in case the architecture of Database Smart Flash Cache is still confusing.

Fast Forward To The Past
During the mid-to-late 1990s, and earlier part of this decade,  DBAs would routinely find themselves managing 32-bit Oracle databases on systems with very large physical memory and a 64-bit Operating System. These systems would support a system buffer cache limited only by physical memory. In order to help reduce physical I/O administrators would sometimes use buffered file system files for Oracle. Oracle I/Os were therefore DMAed in the system buffer cache and then copied in kernel mode into the address space of the calling Oracle process. Re-reads from the system buffer cache eliminated physical I/Os. This model came with significant penalties such as the fact that all data in the SGA buffer cache was also present in the system buffer cache (waste of memory). The memory copy operations from the system buffer cache into the SGA buffers (for every physical I/O) was quite a load on systems of the time period as well.

While working in Sequent Database Engineering in the 1990s I was one of the folks pushing for a model quite similar to how Database Flash Cache works to address Oracle caching needs.  Instead of files on Flash, large “buffer cache files” were to reside in main memory. The proposal was such that when a buffer aged from the SGA buffer pool it would be written  to the “buffer cache file.” Disk reads would occur directly into the SGA. If data happened to be in the “buffer cache file” it would copied from RAM (with a mapping operation) thus eliminating a physical I/O. That would have been much better than the double-buffering effect of using buffered file system files. As it turns out, it would have also been quite similar to Database Smart Flash Cache architecture.

None of that ever came to pass, however, because we devised and implemented Indirect Data Buffers as a much better approach than leveraging system buffer cache or any other such external caching. However, that didn’t keep me from pursuing NUMA-optimized disk caching technology as my NUMA-optimized external cache patent sort of suggests.

What does this have to do with Database Smart Flash Cache?

Is that a Logical Read or a Physical Read? Yes!
The real goal of Database Smart Flash Cache is to reduce physical disk reads. It can’t do anything for writes. Most OLTP/ERP applications do more reading than writing so I think this is OK. The interesting turn on this is that physical disk reads from Database Smart Flash Cache occur in the logical I/O code path.

A logical I/O that hits a buffer previously aged to the Flash Cache requires a physical read to bring the buffer back into the address space of the process. Say what? This technology puts a physical read in the logical read code path? Yes.

Think of it this way. If a process doesn’t find the block it needs in the buffer cache, in a non-Database Smart Flash Cache scenario, it has to perform a read from disk. In the case of Database Smart Flash Cache there is a miss in the first level SGA buffer pool (aka L1 SGA) followed by a hit in the second level SGA buffer pool and at that point there is a physical disk read from Flash. The read from Flash will be quite fast and in the case of a PCI Flash card there won’t even be interaction with a central disk controller since each PCI Flash card has its own built-in disk controller.

In the next installment in this series of blog posts I will cover a performance comparison of Database Flash Cache. However, I need to end this installment with a quick look into how the instance accesses the Database Smart Flash Cache file.

It’s Cache, But It’s A File
In the following box I’ll show the initialization parameters and DBWR open files for an instance that is using a Database Smart Flash Cache file.

NOTE: If you hover over the upper right hand side of the box you’ll see a widget that will allow you to view the box in wide screen

SQL> show parameter db_flash

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_flash_cache_file                  string      /home/cache/scratchfile
db_flash_cache_size                  big integer 31G

# ps -ef | grep dbw
oracle    6466     1  0 Dec07 ?        00:00:35 ora_dbw0_OLTP1
oracle    6468     1  0 Dec07 ?        00:00:39 ora_dbw1_OLTP1
oracle   10428     1  0 Oct20 ?        00:02:59 asm_dbw0_+ASM1
oracle   11733 26379  0 08:17 pts/12   00:00:00 grep dbw
# ls -l /proc/6466/fd | grep file
lrwx------ 1 oracle oinstall 64 Dec  7 19:47 22 -> /home/cache/scratchfile
lrwx------ 1 oracle oinstall 64 Dec  7 19:47 33 -> /home/cache/scratchfile

OK, so that showed us the the DBWR process for the OLTP1 instance has two file descriptors open on the Database Smart Flash Cache file. After starting an OLTP workload I straced DBWR to see how it was interacting with the Flash Cache file:


strace -p 6466 2>&1 | head -10
Process 6466 attached - interrupt to quit
pwrite(33, "\6\242\0\0\345]\0\0HP\316\252\0\0\2\4\371\376\0\0\1\0\0\0\345E\1\0\226L\316\252"..., 8192, 29339877376) = 8192
pwrite(33, "\6\242\0\0004o\0\0\247l\316\252\0\0\2\4\2404\0\0\2\0\0\0\351E\1\0005l\316\252"..., 8192, 29339811840) = 8192
pwrite(33, "\6\242\0\0\344\27\0\0_M\316\252\0\0\2\4t\262\0\0\1\0\0\0\345E\1\0\226L\316\252"..., 8192, 29339795456) = 8192
pwrite(33, "\6\242\0\0\2202\3\0\221R\316\252\0\0\2\4\330\244\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 29338681344) = 8192
pwrite(33, "\6\242\0\0\310S\0\0\17O\316\252\0\0\2\4\262z\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 29338583040) = 8192
pwrite(33, "\6\242\0\0[\236\7\0GU\316\252\0\0\2\4.\301\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 29338492928) = 8192
pwrite(33, "\6\242\0\0\371\277\2\0\364]\316\252\0\0\2\4\371\315\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 29336969216) = 8192
pwrite(33, "\6\242\0\0\271\353\0\0\332e\316\252\0\0\2\4\220\262\0\0\2\0\0\0\350E\1\0\247e\316\252"..., 8192, 29336190976) = 8192
pwrite(33, "\6\242\0\0003F\v\0\227[\316\252\0\0\2\4~*\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 31513272320) = 8192

Well, that’s not good! DBWR is performing synchronous writes to the Flash Cache file. No mystery, I forgot to set filesystemio_options=setall. After doing so, DBWR uses libaio calls to spill into the Flash Cache file (the 5th word in each iocb is the file descriptor). The following snippet shows a blast of 165 writes to the Flash Cache file:


io_submit(47959748861952, 165, {{0x2b9e7fb02630, 0, 1, 0, 33}, {0x2b9e7faa3c70, 0, 1, 0, 33}, {0x2b9e7fa545f8, 0, 1, 0, 33}, {0x2b9e7fb91a70, 0, 1, 0, 33},

{0x2b9e7fa276f0, 0, 1, 0, 33}, {0x2b9e7faf68f8, 0, 1, 0, 33}, {0x2b9e7fafcfb0, 0, 1, 0, 33}, {0x2b9e7fc09f10, 0, 1, 0, 33}, {0x2b9e7fb8b920, 0, 1, 0, 33},

{0x2b9e7fa91880, 0, 1, 0, 33}, {0x2b9e7fb05170, 0, 1, 0, 33}, {0x2b9e7fb80c20, 0, 1, 0, 33}, {0x2b9e7faa6d18, 0, 1, 0, 33}, {0x2b9e7faba140, 0, 1, 0, 33},

[ .. many lines deleted ...]

{0x2b9e7fbb6d20, 0, 1, 0, 33}, {0x2b9e7faa1c00, 0, 1, 0, 33}, {0x2b9e7faeabc0, 0, 1, 0, 33}, {0x2b9e7fa7f490, 0, 1, 0, 33}, {0x2b9e7fb8a380, 0, 1, 0, 33},

{0x2b9e7fad5c90, 0, 1, 0, 33}}) = 165

So we’ve seen the mechanisms used by DBWR to spill into the Flash Cache file. What about foreground processes? The following shows a shadow process performing 8KB synchronous reads from the Flash Cache file. These pread() calls are physical I/O and all, are actually in the extended code path of  logical I/O. We are talking about cache hits after all and the is an SGA LIO.

$ ls -l /proc/32719/fd | grep scratch
lrwx------ 1 oracle oinstall 64 Dec 10 09:19 14 -> /home/cache/scratchfile
[oracle@dscgif05 dbs]$ strace -p 32719 2>&1 | grep pread | head -20
pread(14, "\6\242\0\0\256\215\1\0\337e\316\252\0\0\2\4\304\34\0\0\2\0\0\0\350E\1\0\316e\316\252"..., 8192, 23248191488) = 8192
pread(14, "\6\242\0\0\204\31\6\0\366\367\\\253\0\0\1\4z\321\0\0\1\0\0\0\340E\1\0\366\367\\\253"..., 8192, 17962737664) = 8192
pread(14, "\6\242\0\0\202:\1\0\366S\316\252\0\0\2\4\25\301\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 32084803584) = 8192
pread(14, "\6\242\0\0\204V\5\0\346\272\\\253\0\0\1\4\207\247\0\0\1\0\0\0\340E\1\0\346\272\\\253"..., 8192, 27747041280) = 8192
pread(14, "\6\242\0\0\337\305\0\0\337e\316\252\0\0\2\4ey\0\0\2\0\0\0\350E\1\0ne\316\252"..., 8192, 30495244288) = 8192
pread(14, "\6\242\0\0= \5\0\2035h\253\0\0\1\4)\223\0\0\1\0\0\0\340E\1\0\2035h\253"..., 8192, 20019929088) = 8192
pread(14, "\6\242\0\0\223\v\1\0,T\337\252\0\0\1\6\316\207\0\0\1\0\0\0\342E\1\0+T\337\252"..., 8192, 28805251072) = 8192
pread(14, "\6\242\0\0\214\256\0\0\332e\316\252\0\0\2\4\5a\0\0\2\0\0\0\350E\1\0ne\316\252"..., 8192, 31322808320) = 8192
pread(14, "\6\242\0\0\241w\4\0#\232m\253\0\0\1\4\334A\0\0\1\0\0\0\340E\1\0#\232m\253"..., 8192, 31416205312) = 8192
pread(14, "\6\242\0\0\304<\1\0\374S\316\252\0\0\2\4>\257\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 24838955008) = 8192
pread(14, "\6\242\0\0`\201\1\0\332e\316\252\0\0\2\4=*\0\0\2\0\0\0\350E\1\0\316e\316\252"..., 8192, 21697290240) = 8192
pread(14, "\6\242\0\0z\300\5\0\341Tl\253\0\0\1\4\365\315\0\0\1\0\0\0\340E\1\0\341Tl\253"..., 8192, 21481119744) = 8192
pread(14, "\6\242\0\0O\243\1\0\365e\316\252\0\0\2\4\261!\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 22630391808) = 8192
pread(14, "\6\242\0\0\221Z\r\0\240B]\253\0\0\1\4zK\0\0\1\0\0\0\340E\1\0\240B]\253"..., 8192, 32498204672) = 8192
pread(14, "\6\242\0\0\222Z\r\0\t\335g\253\0\0\1\4$\316\0\0\1\0\0\0\340E\1\0\t\335g\253"..., 8192, 25871138816) = 8192
pread(14, "\6\242\0\0o\245\1\0\365e\316\252\0\0\2\4s\272\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 30909734912) = 8192
pread(14, "\6\242\0\0\373i\r\0\211\10b\253\0\0\1\4.q\0\0\1\0\0\0\30E\1\0\211\10b\253"..., 8192, 27650859008) = 8192
pread(14, "\6\242\0\0\374i\r\0kTc\253\0\0\1\4v0\0\0\1\0\0\0\340E\1\0kTc\253"..., 8192, 18883878912) = 8192
pread(14, "\6\242\0\0t\357\2\0\2678\344\252\0\0\1\6R3\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 23501406208) = 8192
pread(14, "\6\242\0\0003\247\1\0\370e\316\252\0\0\2\4\314\35\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 27493490688) = 8192
$ ps -ef | grep 32719 | grep -v grep
oracle   32719 32700  4 09:17 ?        00:00:05 oracleOLTP1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

Summary
This post in the series aimed to show some of the mechanisms involved when processes interact with the Database Smart Flash Cache file. In Part III I’ll be sharing some performance numbers.

Pardon Me, Where Is That Flash Cache? Part I.

Flash is Flash is Flash, Right?

I was recently catching up on my reading of Jonathan Lewis’ blog when I read his post about Flash Cache. Jonathan’s post contained a few comments on the matter and then refers to Guy Harrison’s post on the topic.

After reading both of those posts I am now convinced more than ever that there is significant confusion over Flash Cache where Oracle Database 11g Release 2 is concerned. Both Jonathan and Guy state that Database Flash Cache (a.k.a. Database Smart Flash Cache) is an Exadata-only feature liberated for use in non-Exadata environments. I need to point out that this is not correct. However, when extremely smart folks are getting it wrong I think it is safe to say that there must not be enough information available.

Distant Cousins

From the initial release of Oracle Database 11g Release 2 the Database Flash Cache feature has had nothing in common with Exadata Smart Flash Cache other than Flash technology happens to be at the very center. In fact, it would be quite difficult for them to be any more dissimilar than they are. Put simply:

  • Database Flash Cache is an extension of the SGA that resides in Flash. Only data buffered in the SGA can make its way into the Database Flash Cache. The data is essentially “aged” out and written by DBWR into the Flash Cache device file.
  • Exadata Smart Flash Cache is PCI Flash  in Exadata Storage Server used for intelligent, adaptive cache. That is to say that regardless of whether the primary buffering of data is in the SGA or PGA, Exadata Smart Flash Cache is down-wind caching data based upon demand.

The words of the Oracle Database 11g New Features list introduces Database Flash Cache as follows:

Database Smart Flash Cache

Database Smart Flash Cache is an optional memory component that you can add if your database is running on Solaris or Oracle Enterprise Linux. It is an extension of the SGA-resident buffer cache, providing a level 2 cache for database blocks. It can improve response time and overall throughput.

As for Exadata Smart Flash Cache, there is a good white paper on the topic available here.

Part II can be found here.

Out of Memory Errors With Oracle Database On RHEL 4.8 Or OEL 4.8? Freeing Memory Perhaps?

In my recent post about free versus reclaimable memory on Linux ( Linux Free Memory, Is It Free Or Reclaimable?) I discussed a technique to quickly shuffle all pages containing clean, buffered file system pages to the free list. If you are interested, please see that post for an explanation for why I would mess with free versus reclaimable pages.  The following is a quick reminder of what /proc/sys/vm/drop_caches is used for:

To clear clean pages from the pagecache:

  • echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes (cached filesystem metadata):

  • echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:

  • echo 3 > /proc/sys/vm/drop_caches

The drop_caches feature was mainlined in the Linux 2.6.16 kernel and thus has been around since about 2006. Using drop_caches should in no way impact dirty data. As such, this should be a totally benign operation. In my experience, it is. I do it all the time with Oracle Database instances running and even while transactions are running. I don’t do it for reasons any production Database Administrator would, but I do it nonetheless. I was, therefore, very surprised to receive feedback from a reader who tried this experiment on an Enterprise Linux RHEL4 system. Now, I was surprised to hear that drop_caches was even available on OEL 4.8 (or RHEL 4.8 for that matter)  since it is 2.6.9 based. I have not touched a RHEL4 or EL4 system in quite some time, but it looks like drop_caches was back-ported from 2.6.16 to the RHEL4 kernel at some point. The reader was testing on OEL version 4.8. His comment went as follows:

Here’s an interesting outcome. On a 4.8 OEL system with about 4 databases running (non-prod). I did a sync, then echo 3>/proc/sys/vm/drop_caches. Looked at free and I got the expected result. However, all the DB’s then returned this error when connecting to them:

ERROR:
ORA-01034: ORACLE not available
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory
Additional information: 1
Additional information: 262149

Interesting, indeed! The reader (Mike) and I tries to investigate this a bit but ran into a wall since none of his OEL 4.8 systems had the strace utility installed. I cannot imagine what sort of memory allocation error this is, but suffice it to say that freeing up free memory resulting in a memory allocation problem is quite an unexpected result.

If anyone has an OEL (or RHEL 4.8) test system with Oracle running and would care to strace for us we’d appreciate it. Simply clear the page cache with the echo into drop_caches and see if you too suffer this connection-time error. If so, simply do it again as follows:

$ strace –o /tmp/bizzare.txt –f sqlplus scott/tiger

I’d love to get my hands on such a bizzare strace file!

Announcement: Hotsos Symposium 2010!

Hotsos Symposium is considered by many to be one of the best Oracle-related educational events each year. This is just a quick blog entry to direct attention to this year’s event. Tom Kyte will be this year’s Keynote and  Tanel Poder is hosting Training Day 2010. In fact, it looks like a lot of OakTable Network Members will be there.

The current list of speakers can be found here.

I’ll be speaking at Hotsos Symposium 2010 as well, but I haven’t finished ironing out the content of my session just yet. I have my ideas for content.

Words About Oracle Direct NFS On Sun Storage 7410 And Chip Multithreading Technology (CMT) For Oracle Database 11g Release 2.

My old friend Glenn Fawcett and I have been teaming up on Sun Oracle Database Machine DW/BI/OLTP performance engineering work, but last week Glenn told me he had an opportunity to put a late model Sun T5240 CMT system through the OLTP wringer. After consulting with me about workload options I set Glenn up with one of my favorite Pro*C Order Entry OLTP workload kits (no, it’s not TPC-C). We thought it would be quite interesting to compare a T5240 to a Sun Fire 4100 (the compute nodes in a Sun Oracle Database Machine) in a head to head test. Nothing official, just pure science. That work is still underway.

For new reader’s sake, I’d like to point out an old blog entry I made (circa 2006) after seeing a lot of TechMarketingBafoonery™ slung about by Sun competitors suggesting the CoolThreads SPARC architecture is unfit for Oracle because it is has only a single floating point engine per socket. That blog entry can be found here:

Marketing Efforts Prove Sun Fire T2000 Is Not Fit For Oracle Database Processing

What Storage To Test With?
The storage option that became available to Glenn was a Sun 7000 NAS device (a.k.a., project Fishworks). I recommended Glenn set up Direct NFS in order to efficiently drive the storage. I see that Glenn has posted some Solaris Direct NFS related information as a result of his efforts over the last couple of days. I recommend Glenn’s post about Solaris Direct NFS Configuration and Verification and Glenn’s new fledgling blog in general.

That Oracle Over NFS Stuff Is So Avant Garde
I’d like to point out, for new blog readers, that I have been a long-standing advocate of Oracle over NFS as a storage solution. The following posts are a good place to start if you want some food for thought on why Oracle Direct NFS is a good storage protocol and Oracle over NFS is a good storage architecture in general:

And, CFS, NFS Topics here.

Linux Free Memory: Is It Free Or Reclaimable? Yes. When I Want Free Memory, I Want Free Memory!

I have certain performance test harnesses that expect to have a lot of free memory. The test harnesses produce a lot of output that facilitate my performance analysis efforts.  The output of these harnesses  is mostly in the form of text files and pipe-delimited files (to be used by SQL*Loader) and CSV files which I upload to Excel.

Sometimes I execute these performance harnesses for long periods of time in which case I can generate a significant amount of file system content. The problem is that I wind up with little or no free memory.

I Want Free Memory, Not Reclaimable Memory
On Linux (as is the case with every Unix derivation) user memory allocations (e.g., stack/heap growth) can be satisfied by the OS through page reclaims. Page reclaims are simply reuse of memory pages that have clean content such as pages of text files read from the file system. A reclaimable page can be matched to a process request for memory extremely fast, I’ll agree. However, there is still a difference between free memory and reclaimable memory. Whether that difference is no more significant than the output of such tools as free(1) is not the point of this blog entry. In fact, if you want to see the “buffer adjusted” free memory on Linux you can use –o argument to the free(1) command. The problem for me can sometimes be that I don’t want to munge through scripts and programs that expect to start with a lot of free memory and change them to weed out the reclaimable effect.

If you want to effectively clean out physical memory of all the cached file system dead wood, and you have proper permissions, you can write values into /proc/sys/vm/drop_caches and get the results you are looking for.

In the following example I have a 72GB physical memory system that is totally idle. There is a single instance of Oracle Database 11g Release 2 booted with an SGA of roughly 22GB. There are no significantly large processes running either. I’m basically sitting on about 50 GB of cached file system “stuff” that I don’t want cached. As the example shows I’ve echoed the numeral 3 into drop_caches and the subsequent execution of free(1) shows the 50 GB of cached file system junk is now wiped out and shown under the “free” column.

# free
             total       used       free     shared    buffers     cached
Mem:      74027752   73688248     339504          0     528164   62777200
-/+ buffers/cache:   10382884   63644868
Swap:     16779884     954164   15825720
# echo 3 >/proc/sys/vm/drop_caches
#
# free
             total       used       free     shared    buffers     cached
Mem:      74027752   22956444   51071308          0       2144   13320800
-/+ buffers/cache:    9633500   64394252
Swap:     16779884     954164   15825720
#

Of course I understand that this is just a shuffle in memory accounting by the kernel, but for certain scripting purposes this might be helpful for someone searching the web someday.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.