In my recent post entitled Something to Ponder? What Sort of Powerful Offering Could a Filesystem in Userspace Be?, I threw out what may have seemed to be a totally hypothetical Teaser Post™. However, as our HP Oracle Exadata Storage Server and HP Oracle Database Machine customers know based on their latest software upgrade, a FUSE-based Oracle-backed file system is a reality. It is called Oracle Database File System (DBFS). DBFS is one of the corner stones of data loading infrastructure in the HP Oracle Database Machine environment. For the time being it is a staging area for flat files to be accessed as external tables. The “back-end”, as it were, is not totally new. See, DBFS is built upon Oracle SecureFiles. FUSE is the presentation layer that makes for mountable file systems. Mixing FUSE with DBFS results in a distributed, coherent file system scalable due to Real Application Clusters. This is a file system that is completely managed by Database Administrators.
So, I’m sure some folks’ eyes are rolling back in their head wondering why we need YAFS (Yet Another File System). Well, as time progresses I think Oracle enthusiasts will come to see just how feature rich something like DBFS really is.
If it performs, is feature rich and incremental to existing technology, it sounds awfully good to me!
I’ll be discussing DBFS in Oracle Exadata Technical Deep Dive – Part IV session tomorrow.
Here is a quick snippet of DBFS in action. In the following box you’ll see a DBFS mount of type FUSE on /data and the listing of a file called all_card_trans.ul
$ mount | grep fuse dbfs on /data type fuse (rw,nosuid,nodev,max_read=1048576,default_permissions,allow_other,user=oracle) $ pwd /data/FS1/stage1 $ ls -l all_card_trans.ul -rw-r--r-- 1 oracle dba 34034910300 Jun 15 15:30 all_card_trans.ul
In the next box you’ll see ssh jumping to 4 of the servers in an HP Oracle Database Machine to list the contents of the DBFS file system and md5sum output to validate that it is the same file.
$ for n in r1 r2 r3 r4 > do > ssh $n md5sum `pwd`/all_card_trans.ul & > done  3943  3945  3946  3947 $ 1adbff1a36a42253c453c22dd031b48b /data/FS1/stage1/all_card_trans.ul 1adbff1a36a42253c453c22dd031b48b /data/FS1/stage1/all_card_trans.ul 1adbff1a36a42253c453c22dd031b48b /data/FS1/stage1/all_card_trans.ul 1adbff1a36a42253c453c22dd031b48b /data/FS1/stage1/all_card_trans.ul  Done ssh $n md5sum `pwd`/all_card_trans.ul  Done ssh $n md5sum `pwd`/all_card_trans.ul  Done ssh $n md5sum `pwd`/all_card_trans.ul  Done ssh $n md5sum `pwd`/all_card_trans.ul
In the next box you’ll see concurrent multi-node throughput. I’ll use one dd process on each of 4 servers in the HP Oracle Database Machine each sequentially reading the contents of the same DBFS-based file and achieving 876 MB/s aggregate throughput. And, no, there is no cache involved.
$ for n in r1 r2 r3 r4; do ssh $n time dd if=`pwd`/all_card_trans.ul of=/dev/null bs=1M &; done  13325  13326  13327  13328 $ 32458+1 records in 32458+1 records 34034910300 bytes (34 GB) copied, 154.117 seconds, 221 MB/s real 2m34.127s user 0m0.014s sys 0m3.073s 32458+1 records in 32458+1 records out 34034910300 bytes (34 GB) copied, 155.113 seconds, 219 MB/s real 2m35.123s user 0m0.020s sys 0m3.127s 32458+1 records in 32458+1 records out 34034910300 bytes (34 GB) copied, 155.813 seconds, 218 MB/s real 2m35.821s user 0m0.026s sys 0m3.210s 32458+1 records in 32458+1 records out 34034910300 bytes (34 GB) copied, 155.89 seconds, 218 MB/s real 2m35.901s user 0m0.017s sys 0m3.039s
With Exadata in mind, the idea is to offer a comprehensive solution for data warehousing. All too often I see data loading claims that start with the flat files sort of magically appearing ready to be loaded. Oh no, we don’t think that way. The data is outside on a provider system somewhere and has to be staged in advance of ETL/ELT. Since DBFS exploits the insane bandwidth of Exadata, it is an extremely good data staging solution. The data has to be rapidly ingested into the staging area and then rapidly loaded. A bottleneck on either part of that equation will be your weakest link.
Just think, no external systems required for data staging. No additional storage connectivity, administration, tuning, etc.
And, yes, it can do more than a single dd process on each node! Much more.