Archive for the 'oracle' Category

Focusing on Ext4 and XFS TRIM Operations – Part I.

I’ve been doing some testing that requires rather large file systems. I have an EMC XtremIO Dual X-Brick array from which I provision a 10 terabyte volume. Volumes in XtremIO are always thinly provisioned. The testing I’m doing required me to scrutinize default Linux mkfs(8) behavior for both Ext4 and XFS. This is part 1 in a short series and it is about Ext4.

Discard the Discard Option

The first thing I noticed in this testing was the fantastical “throughput” demonstrated at the array while running the mkfs(8) command with the “-t ext4” option/arg pair. As the following screen shot shows the “throughput” at the array level was just shy of 72GB/s.

That’s not real I/O…I’ll explain…

EMC XtremIO Dual X-Brick Array During Ext4 mkfs(8). Default Options.

EMC XtremIO Dual X-Brick Array During Ext4 mkfs(8). Default Options.

The default options for Ext4 include the discard (TRIM under the covers) option. The mkfs(8) manpage has this to say about the discard option :

Attempt to discard blocks at mkfs time (discarding blocks initially is useful on solid state devices and sparse / thin-provisioned storage). When the device advertises that discard also zeroes data (any subsequent read after the discard and before write returns zero), then mark all not-yet-zeroed inode tables as zeroed. This significantly speeds up filesystem initialization. This is set as default.

I’ve read that quoted text at least eleventeen times but the wording still sounds like gibberish-scented gobbledygook to me–well, except for the bit about significantly speeding up filesystem initialization.

Since XtremIO volumes are created thin I don’t see any reason for mkfs to take action to make it, what, thinner?  Please let me share test results challenging the assertion that the discard mkfs option results in faster file system initialization. This is the default functionality after all.

In the following terminal output you’ll see that the default mkfs options take 152 seconds to make a file system on a freshly-created 10TB XtremIO volume:


# time mkfs -t ext4 /dev/xtremio/fs/test
mke2fs 1.43-WIP (20-Jun-2013)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
real 2m32.055s
user 0m3.648s
sys 0m17.280s
#

The mkfs(8) Command Without Default Discard Functionality

Please bear in mind that default 152 second result is not due to languishing on pathetic physical I/O. The storage is fast. Please consider the following terminal output where I passed in the non-default -E option with the nodiscard argument. The file system creation took 4.8 seconds:

# time mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

real 0m4.856s
user 0m4.264s
sys 0m0.415s
#

I think 152 seconds down to 4.8 makes the point that with proper, thinly-provisioned storage the mkfs discard option does not “significantly speed up filesystem initialization.” But initializing file systems is not something one does frequently so investigation into the discard mount(8) option was in order.

Taking Ext4 For A Drive

Since I had this 10TB Ext4 file system–and a fresh focus on file system discard (storage TRIM) features–I thought I’d take it for a drive.

Discarded the Default Discard But Added The Non-Default Discard

While the default mkfs(8) command includes discard, the mount(8) command does not. I decided to investigate this option while unlinking a reasonable number of large files. To do so I ran a simple script (shown below) that copies 64 files of 16 gigabytes each–in parallel–into the Ext4 file system. I then timed a single invocation of the rm(1) command to remove all 64 of these files. Unlinking file in a Linux file system is a metadata operation, however, when the discard option is used to mount the file system each unlink operation includes TRIM operations being sent to storage. The following screen shot of the XtremIO performance dashboard was taken while the rm(1) command was running. The discard mount option turns a metadata operation into a rather costly storage operation.

Array Level Activity During Bulk rm(1) Command Processing. Ext4 (discard mount option)

Array Level Activity During Bulk rm(1) Command Processing. Ext4 (discard mount option)

The following terminal output shows the test step sequence used to test the discard mount option:

# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 -o discard /dev/xtremio/fs/test /mnt
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# cd mnt
# cat > cpit
for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done
wait
# time sh ./cpit > /dev/null 2>&1 

real 5m31.530s
user 0m2.906s
sys 8m45.292s
# du -sh .
1018G .
# time rm -f file*

real 4m52.608s
user 0m0.000s
sys 0m0.497s
#

The following terminal output shows the same test repeated with the file system being mounted with the default (thus no discard) mount options:

# cd ..
# umount /mnt ; mkfs -t ext4 -E nodiscard /dev/xtremio/fs/test; mount -t ext4 /dev/xtremio/fs/test /mnt
mke2fs 1.43-WIP (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=16 blocks
335544320 inodes, 2684354560 blocks
134217728 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
81920 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks:
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
 2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# cd mnt
# cat > cpit
for i in {1..64}; do ( dd if=/data1/tape of=file$i bs=1M oflag=direct )& done
wait
#
# time sh ./cpit > /dev/null 2>&1 

real 5m31.526s
user 0m2.957s
sys 8m50.317s
# time rm -f file*

real 0m16.398s
user 0m0.001s
sys 0m0.750s
#

This testing shows that mounting an Ext4 file system with the discard mount option dramatically impacts file removal operations. The default mount options (thus no discard option) performed the rm(1) command in 16 seconds whereas the same test took 292 seconds when mounted with the discard mount option.

So how can one perform the important house-cleaning that comes with TRIM operations?

The fstrim(8) Command

Ext4 supports user-invoked, online TRIM operations on mounted file systems. I would advise people to forego the discard mount option and opt for occasionally running the fstrim(8) command. The following is an example of  how long it takes to execute fstrim on the same 10TB file system stored in an EMC XtremIO array. I think that foregoing the taxation of commands like rm(1) is a good thing–especially since running fstrim is allowed on mounted file systems and only takes roughly 11 minutes on a 10TB file system.

# time fstrim -v /mnt
/mnt: 10908310835200 bytes were trimmed

real 11m29.325s
user 0m0.000s
sys 2m31.370s
#

Summary

If you use thinly-provisioned storage and want file deletion in Ext4 to return space to the array you have a choice. You can choose to take serious performance hits when you create the file system (default mkfs(8) options) and when you delete files (optional discard mount(8) option) or you can occasionally execute the fstrim(8) command on a mounted file system.

Up Next

The next post in this series will focus on XFS.

Announcing “SLOB Recipes”

I’ve started updating the SLOB Resources page with links to “recipes” for certain SLOB testing. The first installment is the recipe for loading 8TB scale SLOB 2.3 Multiple Schema Model with a 2-Socket Linux host attached to EMC XtremIO. Recipes will include (at a minimum) the relevant SLOB program output (e.g., setup.sh or runit.sh), init.ora and slob.conf.

Please keep an eye on the SLOB Resources page for updates…and don’t miss the first installment. It’s quite interesting.

SLOB-recipes

This Is Not Glossy Marketing But You Still Won’t Believe Your Eyes. EMC XtremIO 4.0 Snapshot Refresh For Agile Test / Dev Storage Provisioning in Oracle Database Environments.

This is just a quick blog post to direct readers to a YouTube video I recently created to help explain to someone how flexible EMC XtremIO Snapshots are. The power of this array capability is probably most appreciated in the realm of provisioning storage for Test and Development environments.

Although this is a silent motion picture I think it will speak volumes–or at least 1,000 words.

Please note: This is just a video demonstration to show the base mechanisms and how they relate to Oracle Database with Automatic Storage Management. This is not a scale demonstration. XtremIO snapshots are supported to in the thousands and extremely powerful “sibling trees” are fully supported.

Not Your Father’s Snapshot Technology

No storage array on the market is as flexible as XtremIO in the area of writable snapshots. This video demonstration shows how snapshots allow the administrator of a “DEV” host–using Oracle ASM–to quickly refresh to current or past versions of ASM disk group contents from the “PROD” environment.

The principles involved in this demonstration are:

  1. XtremIO snapshots are crash consistent.
  2. XtremIO snapshots are immediately created, writeable and space efficient. There is no fixed “donor” relationship. Snapshots can be created from other snapshots and refreshes can go in any direction.
  3. XtremIO snapshot refresh does not involve the host operating system. Snapshot and volume contents can be immediately “swapped” (refreshed) at the array level without any action on the host.

Regarding number 3 on that list, I’ll point out that while the operating system does not play a role in the snapshot operations per se, applications will be sensitive to contents of storage immediately changing. It is only for this reason that there are any host actions at all.

Are Host Operations Involved? Crash Consistent Does Not Mean Application-Coherent

The act of refreshing XtremIO snapshots does not change the SCSI WWN information so hosts do not have any way of knowing the contents of a LUN have changed. In the Oracle Database use case the following must be considered:

  1. With a file system based database one must unmount the file systems before refreshing a snapshot otherwise the file system will be corrupted. This should not alarm anyone. A snapshot refresh is an instantaneous content replacement at the array level. Operationally speaking, file system based databases only require database instance shutdown and the unmounting of the file system in preparation for application-coherent snapshot refresh.
  2. With an ASM based database one must dismount the ASM disk group in preparation for snapshot refresh. To that end, ASM database snapshot restore does not involve system administration in any way.

The video is 5 minutes long and it will show you the following happenings along a timeline:

  1. “PROD” and “DEV” database hosts (one physical and one virtual) each showing the same Oracle database (identical DBID) and database creation time as per dictionary views. This establishes the “donor”<->clone relationship. DEV is a snapshot of PROD. It is begat of a snapshot of a PROD consistency group
  2. A single-row token table called  “test” in the PROD database has value “1.” The DEV database does not even have the token table (DEV is independent of PROD…it’s been changing..but its origins are rooted in PROD as per point #1)
  3. At approximately 41 seconds into the video I take a snapshot of the PROD consistency group with “value 1” in the token table. This step prepares for “time travel” later in the demonstration
  4. I then update the PROD token table to contain the value “42”
  5. At ~2:02 into the video I have already dismounted DEV ASM disk groups and started clobbering DEV with the current state of PROD via a snapshot refresh. This is “catching up to PROD”
    1. Please note: No action at all was needed on the PROD side. The refresh of DEV from PROD is a logical, crash-consistent point in time image
  6. At ~2:53 into the video you’ll see that the DEV database instance has already been booted and that it has value “42” (step #4). This means DEV has “caught up to PROD”
  7. At ~3:32 you’ll see that I use dd(1) to copy the redo LUN over the data LUN on the DEV host to introduce ASM-level corruption
  8. At 3:57 the DEV database is shown as corrupted. In actuality, the ASM disk group holding the DEV database is corrupted
  9. In order to demonstrate traveling back in time, and to recover from the dd(1) corrupting of the ASM disk group,  you’ll see at 4:31 I chose to refresh from the snapshot I took at step #3
  10. At 5:11 you’ll see that DEV has healed from the dd(1) destruction of the ASM disk group, the database instance is booted, and the value in the token table is reverted to 1 (step #3) thus DEV has traveled back in time

Please note: In the YouTube box you can click to view full screen or on youtube.com if the video quality is a problem:

More Information

For information on the fundamentals of EMC XtremIO snapshot technology please refer to the following EMC paper: The fundamentals of XtremIO snapshot technology

For independent validation of XtremIO snapshot technology in a highly-virtualized environment with Oracle Database 12c please click on the following link: Principled Technologies, Inc Whitepaper

For a proven solution whitepaper showing massive scale data sharing with XtremIO snapshots please click on the following link: EMC Whitepaper on massive scale database consolidation via XtremIO

Announcing SLOB 2.3. Tarry Not, Get It While It’s Hot!

BLOG UPDATE 2015.07.16: SLOB 2.3.0.3-1 is now the current version.

This is just a quick post to announce SLOB 2.3. Please visit the SLOB Resources page to download the gzipped tar archive. The SLOB Resources page also has a link the SLOB 2.3 Documentation. SLOB Resources Page: Click Here. New in this release:

  1. The documentation is now also included in the tar archive under SLOB/doc in PDF form.
  2. SLOB 2.3 introduces the SLOB Single Schema feature. Please see the documentation.
  3. Because of SLOB Single Schema the kit now supports SLOB Threads. Note, however, SLOB Threads can be used in either Single or Multiple Schema Model.
  4. SLOB 2.3 has two types of “Hot Spots”
    1. In Multiple Schema Model there are both per-schema Hot Spots and a Hot Schema. Please see the SLOB 2.3 documentation for descriptions of these features.
  5. Improved error handling for both the SLOB Data Loader (setup.sh) and Test Execution program (runit.sh).
  6. Licensing. Prior releases of SLOB consisted of copyrighted programs with unclear licensing. Please don’t be alarmed. SLOB is still free to use. The LICENSE file defines the word “use.”

SLOB 2.3 User Guide

SLOB 2.3 is releasing within the next 48 hours. In case anyone wants to read about all the new features here is a link to the SLOB 2.3 User Guide:

SLOB 2.3 User Guide (pdf)

 

SLOB 2.3 Is Getting Close!

SLOB 2.3 is soon to be released. This version has a lot of new, important features but also a significant amount of tuning in the data loading kit. Before sharing where the progress is on that front, I’ll quickly list some of the new important features that will be in SLOB 2.3:

  1. Single Schema Support. SLOB historically avoids application-level contention by having database sessions perform the SLOB workload against a private schema. The idea behind SLOB is to exert maximum I/O pressure on storage while utilizing the minimum amount of host CPU possible. This lowers the barrier to entry for proper testing as one doesn’t require dozens of processors festering in transactional SQL code just to perform physical I/O. That said, there are cases where a single, large active data set is desirable–if not preferred. SLOB 2.3 allows one to load massive data sets quickly and run large numbers of SLOB threads (database sessions) to drive up the load on the system.
  2. Advanced Hot Spot Testing. SLOB 2.3 supports configuring each SLOB thread such that every Nth SQL statement operates on a hot spot sized in megabytes as specified in the slob.conf file. Moreover, this version of SLOB allows one to dictate the offset for the hot spot within the active data set. This allows one to easily move the hot spot from one test execution to the next. This sort of testing is crucial for platform experts studying hybrid storage arrays that identify and promote “hot” data into flash for example.
  3. Threaded SLOB. SLOB 2.3 allows one to have either multiple SLOB schemas or the new Single Schema and to drive up the load one can specify how many SLOB threads per schema will be active.

 

To close out this short blog entry I’ll make note that the SLOB 2.3 data loader is now loading 1TB scale Single Schema in just short of one hour (55.9 minutes exactly). This procedure includes data loading, index creation and CBO statistics gathering. The following was achieved with a moderate IVB-EP 2s20c40t server running Oracle Linux 6.5 and Oracle Database 12c and connected to an EMC XtremIO array via 8GFC Fibre Channel. I think this shows that even the data loader of SLOB is a worthwhile workload in its own right.

SLOB 2.3 Data Loading 1TB/h

Lab Report: Oracle Database on EMC XtremIO. A Compression Technology Case Study.

If you are interested in array-level data reduction services and how such technology mixes with Oracle Database application-level compression (such as Advanced Compression Option), I offer the link below to an EMC Lab Report on this very topic.

To read the entire Lab Report please click the following link:   Click Here.

The following is an excerpt from the Lab Report:

Executive Summary
EMC XtremIO storage array offers powerful data reduction features. In addition to thin provisioning, XtremIO applies both deduplication and compression algorithms to blocks of data when they are ingested into the array. These features are always on and intrinsic to the array. There is no added licensing, no tuning nor configuration involved when it comes to XtremIO data reduction.

Oracle Database also supports compression. The most common form of Oracle Database compression is the Advanced Compression Option—commonly referred to as ACO. With Oracle Database most “options” are separately licensed features and ACO is one such option. As of the publication date of this Lab Report, ACO is licensed at $11,000 per processor core on the database host1. Compressing Oracle Database blocks with ACO can offer benefits beyond simple storage savings. Blocks compressed with ACO remain compressed as they pass through the database host. In short, blocks compressed with ACO will hold more rows of data per block. This can be either a blessing or a curse. Allowing Oracle to store more rows per block has the positive benefit of caching more application data in main memory (i.e., the Oracle SGA buffer pool). On the other hand, compacting more data into each block often results in increased block-contention.

Oracle offers tuning advice to address this contention in My Oracle Support note 1223705.12. However, the tuning recommendations for reducing block contention with ACO also lower the compression ratios. Oracle also warns users to expect higher CPU overhead with ACO as per the following statement in the Oracle Database product documentation:

Compression technology uses CPU. Ensure that you have enough available CPU to handle the additional load.

Application vendors, such as SAP, also produce literature to further assist database administrators in making sensible choices about how and when to employ Advanced Compression Option. The importance of understanding the possible performance impact of ACO are made quite clear in such publications as SAP Note 14363524 which states the following about SAP performance with ACO:

Overall system throughput is not negatively impacted and may improve. Should you experience very long runtimes (i.e. 5-10 times slower) for certain operations (like mass inserts in BW PSA or ODS tables/partitions) then you should set the event 10447 level 50 in the spfile/init.ora. This will reduce the overhead for insertion into compressed tables/partitions.

The SAP note offers further words of caution regarding transaction logging (a.k.a., redo) in the following quote:

Amount of redo data generated can be up to 30% higher

Oracle Database Administrators, with prior ACO experience, are largely aware of the trade-offs where ACO is concerned. Database Administrators who have customarily used ACO in their Oracle Database deployments may wish to continue to use ACO after adopting EMC XtremIO. For this reason Database Administrators are interested in learning how XtremIO compression and Advanced Compression Option interact.

This Lab Report offers an analysis of space savings with and without ACO on XtremIO. In addition, a performance characterization of an OLTP workload manipulating the same application data in ACO and non-ACO tablespaces will be covered…please click the link above to continue reading…

 


EMC Employee Disclaimer

The opinions and interests expressed on EMC employee blogs are the employees' own and do not necessarily represent EMC's positions, strategies or views. EMC makes no representation or warranties about employee blogs or the accuracy or reliability of such blogs. When you access employee blogs, even though they may contain the EMC logo and content regarding EMC products and services, employee blogs are independent of EMC and EMC does not control their content or operation. In addition, a link to a blog does not mean that EMC endorses that blog or has responsibility for its content or use.

This disclaimer was put into place on March 23, 2011.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,445 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2013. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

Follow

Get every new post delivered to your Inbox.

Join 2,445 other followers