Disk Is Dead? Says Who?

INTRODUCTION

Before we begin, I want to state right up front that I am not anti-flash, nor am I anti-hardware. I work for DataCore Software which has mastered the ability to exploit hardware capabilities for nearly two decades for the sole purposes of driving storage I/O. Our software needs hardware and hardware needs software to instruct it to do something useful. However, over the last the last year I have read a lot of commentary about how disk (aka. HDDs or magnetic media) is dead. Colorful metaphors such as “spinning rust” are used to describe the apparent death of the HDD market, but is this really the case?

According to a report from TrendFocus, the number of drives that shipped in 2015 declined by 16.9% (to 469 million units), however the amount of capacity that shipped increased by more than 30% (to 538 exabytes, or 538,000 petabytes, or 538,000,000 terabytes). In other words, a lot of HDD capacity.

Please note however, this is NEW capacity added to the industry on top of the already mind-blowing amount of existing capacity in the field today (estimated at over 10 zettabytes, or 10,000 exabytes, or, well, you get the idea). Eric Brewer, VP of Infrastructure at Google recently said,

“YouTube users are uploading one petabyte every day and at current growth rates that they should be uploading 10 petabytes per day by 2021.”

The capacity trend certainly doesn’t show signs of slowing which is why new and improved ways of increasing HDD density are emerging (such as Helium-filled drives, HAMR, and SMR). With these new manufacturing techniques, HDD capacities are expected to reach 20TB+ by 2020.

So, I wouldn’t exactly say disk (HDD) is dead, at least from a capacity demand perspective, but it does raise some interesting questions about the ecosystem of drive technology. Perhaps the conclusion that disk is dead is based on drive performance. There is no doubt a battle is waging in the industry. On one side we have HDD, on the other SSD (or flash). Both have advantages and disadvantages, but must we choose between one or the other? Is it all or nothing?

MOVE TO FLASH NOW OR THE SKY WILL FALL

In addition to the commentary about disk being dead, I have seen an equal amount of commentary about how the industry needs to adopt all-flash tomorrow or the world will come to an end (slight exaggeration perhaps). This is simply an impossible proposition. According to a past Gartner report,

 “it will be physically impossible to manufacture a sufficient number of SSDs to replace the existing HDD install base and produce enough to cater for the extra storage growth

Even displacing 20% of the forecasted growth is a near impossibility. And I will take this one step further, not only is it impossible, it is completely unnecessary. However, none of this implies HDD and SSD cannot coexist together in peace, they certainly can. What is needed is exactly what Gartner said in the same report,

“ensure that your choice of system and management software will allow for seamless integration and intelligent tiering of data among disparate devices.”

The reason Gartner made this statement is because they know only a small percentage of an organization’s data footprint benefits from residing on high-performance media.

THE SOLUTION TO THE PROBLEM IS SOFTWARE

One of the many things DataCore accomplishes with the hardware it manages is optimizing the placement of data across storage devices with varying performance characteristics. This feature is known as auto-tiering and DataCore does this automatically across any storage vendor or device type whether flash or disk based.

Over the last six years, DataCore has proven with its auto-tiering capability that only 3-5% of the data within most organizations benefit from high-performance disk (the percentage is even less when you understand how DataCore’s Parallel I/O and cache works, but we will touch on this later). Put another way, 95% of an organization’s I/O demand occurs within 3-5% of the data footprint.

While the 3-5% data range doesn’t radically change from day to day, the data contained within that range does. The job of DataCore’s auto-tiering engine is to ensure the right data is on the right disk at the right time in order to deliver the right performance level at the lowest cost. No need to wait, schedule, or perform any manual steps. By the way, the full name of DataCore’s auto-tiering feature is: fully automated, sub-LUN, real-time, read and write-aware, heterogeneous auto-tiering. Not exactly a marketing-friendly name, but there it is.

WAIT A SECOND, I THOUGHT THIS WAS ABOUT DISK, NOT FLASH

While DataCore can use flash technologies like any other disk, it doesn’t require them. To prove the point, I will show you a very simple test I performed to demonstrate the impact just a little bit of software can have on the overall performance of a system. If you need a more comprehensive analysis of DataCore’s performance, please see the Storage Performance Council’s website.

In this test I have a single 2U Dell PowerEdge R730 server. This server has two H730P RAID controllers installed. One RAID controller has five 15k drives attached to it forming a RAID-0 disk group (read and write cache enabled). This RAID-0 volume is presented to Windows and is designated as the R: drive.

The other RAID controller is running in HBA mode (non-RAID mode) with another set of five 15k drives attached to it (no cache enabled). These five drives reside in a DataCore disk pool. A single virtual disk is created from this pool matching the size of the RAID-0 volume coming from the other RAID controller. This virtual disk is presented to Windows and is designated as the S: drive.

DellOMSAThe first set of physical disks forming the RAID-0 volume as seen in the OpenManage Server Administrator interface – larger

 

DC_DiskPool
The second set of physical disks and disk pool as seen from within the DataCore Management Console – larger

 

Win_LogicalDrivesThe logical volumes R: and S: as seen by the Windows operating System – larger

DRIVERS, START YOUR ENGINES

I am going to run an I/O generator tool from Microsoft called DiskSpd (formally known as SQLIO) against these two volumes simultaneously and compare the results using Windows Performance Monitor. The test parameters for each test are identical: 8K block, 100% random, 80% read, 20% write, running 10 concurrent threads, with 8 outstanding I/Os against a 10GB test file.

DiskSpdDiskSpd test parameters for each logical volume – larger

The first command on line 2 is running against the RAID-0 disk (R:) and the second command on line 5 is running against the DataCore virtual disk (S:). In addition to having no cache enabled on the HBA connecting the physical disks presented to DataCore within the pool, the DataCore virtual disk also has its write-cache disabled (or write-through enabled). Only DataCore read cache is enabled here.

SingleVDWrite-cache disabled on the DataCore virtual disk – larger

PerfMon_1Performance view of the RAID-0 disk – larger

 

PerfMon_2Performance view of the DataCore virtual disk – larger

As you can see from the performance monitor view, the disk being presented from DataCore is accepting over 26x more I/O per second on average (@146k IOps) than the disk from the RAID controller (@5.4k IOps) for the exact same test. How is this possible?

This is made possible by DataCore’s read cache and the many I/O optimization techniques DataCore uses to accelerate storage I/O throughout the entire stack. For much more detail on these mechanisms, please see my article on Parallel Storage.

In addition to Parallel I/O processing, I am using another nifty feature called Random Write Accelerator. This feature eliminates the seek time associated with random writes (operations which cause lots of armature action on the HDD). DataCore doesn’t communicate with the underlying disks the same way the application would directly. By the time the I/O reaches the disks in the pool the I/O pattern is much more orderly and therefore more optimally received by the disks.

So now as any good engineer would do, I’m going to turn it up a notch and see what this single set of five physical so-called “dead disks” can do. I will now test using five 50GB virtual disks. Remember, these virtual disks are coming from a DataCore disk pool which contain five 15k non-RAID’d disks. Let’s see what happens.

DiskSpd_2DiskSpd test parameters for five DataCore virtual disks – larger

The commands on lines 8-12 are running against the five DataCore virtual disks. Below are the results of the testing.

PerfMon_3Performance view of the five DataCore virtual disks – larger

Note, nothing has changed at the physical disk layer. The change is simply an increase in the number of virtual disks now reading from and writing to the disk pool which in turn has increased the degree of parallelism in the system. This test shows for the same physical disks we have achieved greater than a 63x performance increase on average (@344k IOps) with bursts well over 400k IOps. This test is throwing 70-80,000 write I/Os per second at physical disks which are only rated to deliver 900 random writes per second combined. This is made possible by sequentializing the random writes before they reach the physical disks and therefore eliminating most of the armature action on the HDDs. Without adding any flash to the system, the software has effectively returned greater than flash-like performance with only five 15k disks in use.

Another important note. This demonstration is certainly not representative of the most you can get out of a DataCore configuration. On the latest SPC-1 run where DataCore set the world-record for all out performance, DataCore reached 5.12 million SPC-1 IO per second with only two engines (and the CPUs on those engines were only 50% utilized).

CONCLUSION

There are two things happening in the storage industry which has caused a lot of confusion. The first is an unawareness of the distinction between I/O parallelization and device parallelization. DataCore has definitively proven its I/O parallelization technique is superior in performance, cost, and efficiency. Flash is a form of device parallelization and can only improve system performance to a point. Device parallelization without I/O parallelization will not take us where the industry is demanding we go (see my article on Parallel Storage).

The second is a narrative being pushed on the industry which says “disk is dead” (likely due to my first concluding point). The demonstration above proves “spinning disk is very much alive”. Someone may argue I’m using a flash-type device in the form of RAM to serve as cache. Yes, RAM is a solid state device (a device electronic in nature), but it is not exotic, has superior performance characteristics, and organizations already have tons of it sitting in very powerful multiprocessor servers within their infrastructures right now. They simply need the right software to unlock its power.

Insert DataCore’s software layer between the disk and the application and immediately unbind the application from traditional storage hardware limitations.

Parallel Application Meets Parallel Storage

INTRODUCTION

A shift in the computer industry has occurred. Did you notice it? It wasn’t a shift that happened yesterday or even the day before, but rather 11 years ago. The year was 2005 and Moore’s Law as we know it took a deviation from the path that it had been traveling on for over 35 years. Up until this point in history, improved processor performance was mainly due to frequency scaling, but when the core speed reached ~3.8GHz, the situation quickly became cost prohibitive due to the physics involved with pushing beyond this barrier (factors such as core current, voltage, heat dissipation, structural integrity of the transistors, etc.). Thus, processor manufacturers (and Moore’s Law) were forced to take a different path. This was the dawning of the massive symmetrical multiprocessing era (or what we refer to today as ‘multicore’).

The shift to superscalar symmetrical multiprocessing (SMP) architectures now required a specialized skill set in parallel programming in order to fully realize the performance increase across the numerous processor resources. It was no longer enough to simply rely on frequency scaling to better application response times and throughput. Interestingly today, more than a decade later, a severe gap persists in our ability to harness the power of the multicore mainly due to either a lack of understanding of parallel programming or the inherent difficulty in porting a well-established application framework to a parallel programming construct. Perhaps virtualization is also responsible for some of the gap since the entire concept of virtualization (specifically compute virtualization) is to create many independent virtual machines whereby each one can run the same application simultaneously and independently. Within this framework, the demand for parallelism at the application level may have diminished since the parallelism is handled by the abstraction layer and scheduler within the compute hypervisor (and no longer as necessary for the application developer, I’m just speculating here). So, while databases and hypervisors are largely rooted in parallelism, there is one massive area that still suffers from a lack of parallelism, and that is storage.

THE PARALLEL STORAGE REVOLUTION BEGINS

In 1998, DataCore Software began work on a framework specifically intended for driving storage I/O. This framework would become known as a storage hypervisor. At the time, the best multiprocessor systems that were commercially available were multi-socket single-core systems (2 or 4 sockets per server). From 1998 to 2005, DataCore perfected the method of harnessing the full potential of common x86 SMP architectures with the sole purpose of driving high-performance storage I/O. For the first time, the storage industry had a portable software-based storage controller technology that was not coupled to a proprietary hardware frame.

In 2005, when multicore processors arrived in the x86 market, an intersection formed between multicore processing and increasingly parallel applications such as VMware’s hypervisor and parallel database engines such as SQL and Oracle. Enterprise applications started to slowly become more and more parallel, while surprisingly, the storage subsystems that supported these applications remained largely serial.

MEANWHILE, IN SERIAL-LAND

The serial nature of storage subsystems did not go unnoticed, at least by storage manufacturers. It was well understood that at the current rate of increase in processor density coupled with wider adoption of virtualization technologies (which drove much higher I/O demand density per system), a change was needed at the storage layer to keep up with increased workloads.

In order to overcome the obvious serial limitation in storage I/O processing, the industry had to make a decision to go parallel. At the time, the path of least resistance was to simply make disks faster, or taken from another perspective, make solid state disks, which by 2005 had been around in some form for over 30 years, more affordable and with higher densities.

As it turns out, the path of least resistance was chosen, either because alternative methods of storage I/O parallelization were unrealized or perhaps there was an unwillingness by the storage industry to completely recode their already highly complex storage subsystem programming. The chosen technique, referred to as [Hardware] Device Parallelization, is now used by every major storage vendor in the industry. The only problem with it is, it doesn’t drastically address the fundamental problem of storage performance which is latency.

Chris Mellor from The Register wrote recently in an article, “The entire recent investment in developing all-flash arrays could have been avoided simply by parallelizing server IO and populating the servers with SSDs.”

TODAY’S STORAGE SYSTEMS HAVE A FATAL FLAW

There is one major fatal flaw in modern storage subsystem design and it is this: today’s architectures are still using the old method of dealing with the problem of I/O by pushing the problem down to the physical disk layer. The issue is that the disk layer is both the furthest point away from the application which is generating the I/O demand and simultaneously the slowest component in the entire storage stack (yes, including flash).

In order to achieve any significant performance improvement from the applications’ perspective, a large amount of physical disks must be introduced into the system, either in the form of HDDs or SSDs (An SSD is a good example of singular device parallelization because it represents a multiple of HDDs in a single package. SSDs are not without their own limitations however. While SSDs do not suffer from mechanical latencies like HDDs do, they do suffer from a phenomenon known as write-amplification).

A NOT-SO-NEW APPROACH TO PARALLELIZATION

Another approach to dealing with the problem of I/O is to flip the problem on its head, in a manner of speaking. Conversely, rather than dealing with the I/O at the furthest point from the application and with the slowest components, like device parallelization attempts to do, let’s entertain the possibility of addressing the I/O as soon as it is encountered and with the fastest components in the stack. Specifically, let’s use the abundance of processors and RAM that now exist in today’s modern server architectures to get the storage subsystem out of the way of the application. This is precisely what DataCore‘s intentions were in 1998 and with the emergence of multicore processors in 2005, the timing could not have been better.

Let’s take a look at a depiction of what this looks like in theory:

parallelIO

The contributory improvement of storage performance per device using the device parallelization technique simply cannot compare to that of the I/O parallelization technique. Simply put, the parallelization that the industry is attempting to use to solve the storage I/O bottleneck is being applied at the wrong layer. I will prove this fact with a real world comparison.

perftable

On the latest showing of storage performance superiority, DataCore posted a world-record obliterating 5.12 Million SPC-1 IOps while simultaneously achieving one of the lowest $/IO ever seen ($0.10 per IO), only being beat out on the $/IO measurement by another DataCore configuration. Comparatively, the DataCore IOps result was faster than the previous #1 and #2 test runs from Huawei and Hitachi, COMBINED! For a combined price of $4.37 million dollars (cost of Huawei and Hitachi) and four racks of hardware (size of both Huawei and Hitachi test configurations) you still can’t get the performance that DataCore achieved with only 14U of hardware (1/3rd of one rack) and a cost of $506,525.24.

latency2

Put another way, DataCore is nearly 1/9th the cost at 1/12th the size and delivered better than 1/3rd the response time than Huawei and Hitachi combined. If you try to explain this in terms of traditional storage or device parallelization techniques, you cannot get there. In fact the only conclusion you can reach using that technique is that it is impossible, and you would be correct. But it is not impossible when you understand the technique DataCore uses. This technique is referred to as I/O Parallelization.

MORE THAN SIMPLY CACHE

Some have argued recently that it is simply the use of RAM as cache that allowed DataCore to achieve such massive performance numbers. Well, if that was true, then anyone should be able to reproduce DataCore’s numbers tomorrow because it is not as if we have a RAM shortage in the industry. By the way, the amount of RAM cache in the Hitachi and Huawei systems combined was twice the amount DataCore used in its test run.

What allowed DataCore to achieve such impressive numbers is a convergence of several factors:

  • CPU power is abundant and continues to increase 20% annually
  • RAM is abundant, cheap, and doesn’t suffer from performance degradation like flash does
  • Inter-NUMA performance within SMP architectures have approached near-uniform shared memory access speeds
  • DataCore exploits the capabilities of modern CPU and RAM architectures to dramatically improve storage performance
  • DataCore runs in a non-interrupt non-blocking state which is optimal for storage I/O processing
  • DataCore runs in a real-time micro-kernel providing the determinism necessary to match the urgent demands of processing storage I/O
  • DataCore deploys anti-queuing techniques in order to avoid queuing delay when processing storage I/O
  • DataCore combines all these factors across the multitude of processors in parallel

CONCLUSION

So what does this mean? What does this mean for me and my applications?

First, it means that we now live in an era that has parallel processing occurring at both the application layer and the storage layer. Second, it means that applications are now free to process at top performance because the storage system is now out of the way. And finally, it means that the act of having to spend more money creating larger and larger environments in order to achieve high performance is abolished.

Applications are now unlocked and the technology is now within reach of everyone, let’s go do something amazing with it!

A Match Made in Silicon

I was reminiscing the other day about the old MS-DOS days. I remember being fascinated by the concept of using a RAM disk to make “stuff” run faster. Granted, I was only 10 years old, and while I didn’t understand the intricacies of how this was being accomplished at the time, I understood enough to know that when I put “stuff” into the RAM disk, it ran much faster than my 80MB Connor hard drive. If the RAM disk was only slightly faster it wouldn’t have been that interesting, but it was amazingly faster.

By the mid-90’s, many commercial applications, specifically databases, began treating RAM more and more like a disk rather than just simply a high-speed working space for the application. Today there are many very well known in-memory database (IMDB) systems, most notably from Microsoft (SQL Server/Hekaton), Oracle (RDBMS), and SAP (HANA), to name a few.

In 1998, DataCore Software set out, among many other things, to use RAM as a general-purpose caching layer made accessible via software that could be installed on any x86 based system for any application. With the introduction of Intel multi-core processors in 2005, the software evolved even more to include exploitation of the additional processors, in parallel. Processors and RAM were getting faster and more abundant, which meant a much higher potential for tapping into the power of parallelism.

Now let’s fast forward to more recent times…

FORT LAUDERDALE, Fla., June 15, 2016 – Following a scorching run of world records, DataCore Software today rocketed past the old guard of high-performance storage systems to achieve a remarkable 5.1 million (5,120,098.98) SPC-1 IOPS™ on the industry’s most respected head-to-head comparison — the Storage Performance Council’s SPC-1™ benchmark. This new result places DataCore number one on the SPC-1 list of Top Ten by Performance. To put the accomplishment into perspective, the independently-audited SPC-1 Result for the DataCore™ Parallel Server software confirms the product as faster than the previous top two leaders combined.

The benefits of using RAM as cache cannot be denied. It worked very well in the beginning as RAM disks. It worked extremely well for IMDBs. Today, DataCore Software is the world-record holder for the fastest block storage system ever tested by the Storage Performance Council, not simply because of the use of RAM as cache, but more specifically because of the software mechanism used to turn the RAM into cache. If it was simply a matter of using RAM as cache, then any storage vendor should be able to reproduce what DataCore produced at the same or better price point on the SPC-1, tomorrow. I wouldn’t recommend holding your breath on that one.

In essence, what DataCore has done is create the world’s fastest in-memory “everything” storage engine (i.e. file data, object data, virtual machines, AND databases). Modern Intel x86-64 based architectures combined with the fastest RAM is truly a match made in silicon… a match only made possible and held together by the most efficient and most powerful storage software ever developed.

DataCore Introduces a New Breakthrough Random Write Accelerator for Update Intensive Databases, ERP, OLTP and RAID-5 Workloads

Introduction
It’s here! DataCore Software this week has released exciting new breakthrough feature extending the arsenal of enterprise features already present within SANsymphony-V. This new feature serves to enhance the performance of random write workloads which are among the most costly operations that can be performed against a storage system. The new Random Write Accelerator in effect takes highly random workloads and sequentializes them to achieve greater performance. The Random Write Accelerator has shown yields up to 30 times faster performance for random-write-heavy workloads that frequently update databases, ERP and OLTP systems. Even greater performance gains have been realized on RAID-5 protected datasets that spread data and reconstruction information to multiple locations across different disk drives. The new feature is now available and included within SANsymphony™-V10 PSP1.

Internal testing with the Random Write Accelerator feature and 100% random write workloads yielded significant performance improvements for spinning disks (>30x improvement) and even noteworthy improvements for SSDs (3x improvement) under these conditions. The specific performance numbers will be covered later in this article.

The actual performance benefits will vary greatly depending on the percentage of random writes that make up the application’s I/O profile and the types of storage devices participating within the storage pool. Additionally, the feature is enabled on a per-virtual disk basis, allowing you to be very selective about when to apply the optimization.

Basis For Development
As applications drive storage system I/O, DataCore’s high-speed caching engine improves virtual disk read performance. The cache also improves write performance, but its flexibility is limited due to the need to destage data to persistent storage. In many environments the need to synchronize write I/O with back-end storage becomes the limiting factor to the performance that can be realized at the application level; hence the purpose of this development.

With some types of storage devices, there are significant performance limitations associated with non-sequential writes compared with sequential writes. These limitations occur due to:

  • Physical head movement across the surface of the rotating disk
  • RAID-5 reads to recalculate the parity data
  • Write amplification in SSDs

DataCore SANsymphony-V software presents an abstraction to the application — a virtual SCSI disk. The way that SANsymphony-V stores the data associated with these virtual disks is an implementation detail hidden from the application. Data may be placed invisibly across storage devices in different tiers to take advantage of their distinct price/performance/capacity characteristics. The data may also be mirrored between devices in separate locations to safeguard against equipment and site failures. The SANsymphony-V software can use different ways to store application data to mitigate the aforementioned limitations, while not changing the abstraction presented to the applications.

Function Details
The Random Write Accelerator changes the way SANsymphony-V stores data written to the virtual disks by:

  • Storing all writes sequentially
  • Coalescing writes to reduce the number of I/Os to back-end storage
  • Indexing the sequential structure to identify the latest data for any given logical block address
  • Directing reads to the latest data for a block using this index
  • Compacting data by copying it and removing blocks that have been rewritten

Performance Details
Now the part everyone is waiting for, the performance numbers. There are three main states to consider from a performance perspective:

  • Base – the underlying level of performance that can be achieved with a 100% random write workload, without Sequential Storage enabled.
  • Maximum – the performance that can be achieved with a 100% random write workload, with Sequential Storage enabled but without compaction active.
  • Sustained – the performance that can be sustained with a 100% random write workload, with Sequential Storage enabled and with compaction active.

The greatest performance is achieved during the Maximum state. When the virtual disk is idle, a background level of compaction will occur to prepare the system to absorb another burst of random write activity. That is, the background compaction will prepare the virtual disks to deliver performance associated with the Maximum state.

The following performance has been observed using IOmeter running a 100% write, 100% random workload with a 4K block size and 64 outstanding I/Os:


* DataCore cache enabled for before and after scenarios. IOmeter test: 100% write, 100% random workload with a 4K block size and 64 I/Os outstanding.

Interesting Observations
The above results highlight 3 key observations:

  • Significant acceleration (>30x improvement) of low-cost SATA disks for random write loads is possible. In fact in this particular test with DataCore, the resulting sustained performance of 11,000 IOPS actually exceeded that of a conventional Solid State Disk which ran at 10,000 IOPS.
  • The Solid State Disk also displayed improved performance going from 10,000 IOPS to 36,000 IOPS (>3x improvement).
  • Write intensive RAID-5 workloads displayed the greatest amount of improvement from 860 IOPS to 40,000 IOPS (>45x improvement).

Conclusion
DataCore’s Random Write Accelerator capability aims to address a limitation every storage system experiences to some extent. Random writes not only severely impact application performance within mechanical systems such as magnetic disks, they can also drastically reduce the performance and shorten the lifespan of SSD/flash based devices because of the write amplification effects produced from the write I/O pattern (see this publication for more detail). Check out this new feature along with many others in the now available SANsymphony™-V10 PSP1 release.

DataCore’s Answer to Random Write Workloads: Sequential Storage

Introduction
DataCore Software has developed another exciting new feature extending the arsenal of enterprise features already present within SANsymphony-V. This new feature serves to enhance the performance of random write workloads which are among the most costly operations that can be performed against a storage system. The new Sequential Storage feature will be available in SANsymphony™-V10 PSP1 scheduled for release within the next 30 days.

Internal testing with the Sequential Storage feature and 100% random write workloads yielded significant performance improvements for spinning disks (>30x improvement) and even noteworthy improvements for SSDs (>3x improvement) under these conditions. The specific performance numbers will be covered later in this article.

The actual performance benefits will vary greatly depending on the percentage of random writes that make up the application’s I/O profile and the types of storage devices participating within the storage pool. Additionally, the feature is enabled on a per-virtual disk basis, allowing you to be very selective about when to apply the optimization.

Basis For Development
As applications drive storage system I/O, DataCore’s high-speed caching engine improves virtual disk read performance. The cache also improves write performance, but its flexibility is limited due to the need to destage data to persistent storage. In many environments the need to synchronize write I/O with back-end storage becomes the limiting factor to the performance that can be realized at the application level; hence the purpose of this development.

With certain types of storage devices, there are significant performance limitations associated with non-sequential writes compared with sequential writes. These limitations occur due to:

  • Physical head movement across the surface of the rotating disk
  • RAID-5 reads to calculate parity data
  • Write amplification inherent to Flash and SSD devices

DataCore SANsymphony-V software presents an abstraction to the application — a virtual SCSI disk. The way that SANsymphony-V stores the data associated with these virtual disks is an implementation detail hidden from the application. Data may be placed invisibly across storage devices in different tiers to take advantage of their distinct price/performance/capacity characteristics. The data may also be mirrored between devices in separate locations to safeguard against equipment and site failures. The SANsymphony-V software can use different ways to store application data to mitigate the aforementioned limitations, while not changing the abstraction presented to the applications.

Functional Details
Sequential Storage changes the way SANsymphony-V stores data written to the virtual disks by:

  • Storing all writes sequentially
  • Coalescing writes to reduce the number of I/Os to back-end storage
  • Indexing the sequential structure to identify the latest data for any given logical block address
  • Directing reads to the latest data for a block using this index
  • Compacting data by copying it and removing blocks that have been rewritten

Performance Details
Now the part everyone is waiting for – the performance numbers. There are three main states to consider from a performance perspective:

  • Base – the underlying level of performance that can be achieved with a 100% random write workload, without Sequential Storage enabled.
  • Maximum – the performance that can be achieved with a 100% random write workload, with Sequential Storage enabled but without compaction active.
  • Sustained – the performance that can be sustained with a 100% random write workload, with Sequential Storage enabled and with compaction active.

The greatest performance is achieved during the Maximum state. When the virtual disk is idle, a background level of compaction will occur to prepare the system to absorb another burst of random write activity. That is, the background compaction will prepare the virtual disks to deliver performance associated with the Maximum state.

The following performance has been observed using IOmeter running a 100% write, 100% random workload with a 4K block size and 64 outstanding I/Os:

Base IOPS Maximum IOPS Sustained IOPS
Linear 20 GB volume, SATA WDC 1 TB drive 327 19,500 11,000
Linear 20 GB volume, SSD 840 EVO 250 GB Pool 10,000 62,000 36,000
Mirrored 100 GB volume, PERC H-800 RAID-5 Pool 860 67,000 40,000

Interesting Observations
The above results highlight 3 key observations:

  • Significant acceleration (>30x improvement) of low-cost SATA disks for random write loads is possible. In fact in this particular test with DataCore, the resulting sustained performance of 11,000 IOPS actually exceeded that of a conventional Solid State Disk which ran at 10,000 IOPS.
  • The Solid State Disk also displayed improved performance going from 10,000 IOPS to 36,000 IOPS (>3x improvement).
  • Write intensive RAID-5 workloads displayed the greatest amount of improvement from 860 IOPS to 40,000 IOPS (>45x improvement).

Conclusion
DataCore’s Sequential Storage capability aims to address a limitation every storage system experiences to some extent. Random writes not only severely impact application performance within mechanical systems such as magnetic disks, they can also drastically reduce the performance and shorten the lifespan of SSD/flash based devices because of the write amplification effects produced from the write I/O pattern (see this publication for more detail). You can expect this feature along with many others in SANsymphony™-V10 PSP1 due out in November 2014.

Introduction to DataCore Virtual SAN

DataCore Virtual SANs

DataCore™ SANsymphony™-V virtual SAN represents an alternative to external SANs which pools storage devices directly attached to a group of host servers yielding improved application performance and high-availability within a highly-consolidated server infrastructure.

A DataCore virtual SAN can leverage any combination of flash memory, solid state disks (SSD) and magnetic disks to provide persistent storage services as close to the application as possible without having to go out over the wire. Virtual disks provisioned from the virtual SAN can also be shared across a cluster of servers within the server group to support the dynamic migration and failover of applications between nodes.

Ideal Use Cases for DataCore Virtual SANs

Consider the DataCore virtual SAN solution for:

Latency-sensitive applications
Speed up response and throughput by leveraging flash memory as persistent storage close to the applications and caching reads and writes from even faster server DRAM memory.

Small Server Clusters in Remote Sites, Branch Offices and Small Computer Rooms
Put the internal storage capacity of your servers to work as a shared resource while protecting your data against server outages simply by adding SANsymphony-V software.

Virtual Desktop (VDI) Deployment
Run more virtual desktops on each server and scale them out across more servers without the complexity or expense of an elaborate external SAN.

Virtual SAN and Virtual SAN Nodes

A DataCore virtual SAN is comprised of two or more physical x86-64 servers with local storage, hosting applications. Up to a maximum of 32 servers may be configured within a centrally-managed group. Each server participating in the virtual SAN must run a properly licensed instance of SANsymphony-V. These physical servers are also referred to as virtual SAN nodes (or simply nodes). Generally each node contributes storage to the group’s shared pool. You may also configure SANsymphony-V nodes that do not contribute storage to the pool but host applications that access storage from the pool.

Virtual SAN Storage Devices
Uninitialized block storage devices attached to a node (with the exception of its boot drive) can be included in the virtual SAN. Any combination of flash memory, SSD and magnetic disks may be used. Removable USB devices are not supported since they cannot be relied upon to be present.

SANsymphony-V Deployment Options
There are three ways to configure the SANsymphony-V software on the application servers depending on the operating system or server hypervisor controlling the physical machine.

Physical Windows Server (no server hypervisor installed)
SANsymphony-V runs directly on top of the Windows Server operating system. All local block storage devices that are not initialized are automatically detected as suitable for the pool. An application such as Microsoft Exchange or SQL may be installed alongside SANsymphony-V. Windows Failover Cluster or other clustering technology can be used to provide application failover between servers.

Windows Server with Hyper-V
SANsymphony-V runs in the root partition (also referred to as the parent partition) on top of the Windows Server operating system. All local block storage devices that are not initialized are automatically detected as suitable for the pool. The Microsoft Hyper-V hypervisor role is installed alongside SANsymphony-V.

VMware ESXi and other non-Windows based hypervisors
SANsymphony-V runs within a dedicated Windows Server virtual machine. The administrator assigns uninitialized storage devices from the server hypervisor to the SANsymphony-V virtual machine as raw storage devices (RDMs in ESXi). The presentation of raw storage devices is preferred, but may not always be an option based on hypervisor and/or local RAID controller capabilities.

NOTE: All local disk-RDM mapping files being presented to SANsymphony-V must reside on the node’s local datastore (not within a virtual volume presented from SANsymphony-V).

Enterprise SAN Features
SANsymphony-V provides the following enterprise storage features:

  • Automated Storage Tiering
  • Advanced Site Recovery *
  • Analysis and Reporting
  • Asynchronous Replication *
  • Channel Load Balancing
  • Continuous Data Protection (CDP) *
  • Fibre Channel support *
  • High-Speed Caching
  • iSCSI support
  • NAS/SAN (Unified Storage)
  • Snapshot
  • Storage Migration and Pass-through Disks
  • Storage Pooling
  • Synchronous Mirroring
  • Thin Provisioning

(*) Optional Features

Virtual SAN Licensing
A SANsymphony-V license is required per node. Licenses are based on the amount of physical storage the node contributes to the shared pool. Some features are separately priced. Please refer to a DataCore authorized representative for more information regarding licensing and pricing.

Disk Pools
In SANsymphony-V, storage devices are organized into disk pools. You may create multiple disk pools within a virtual SAN node to distinguish how each resource pool will be used. For example, one would create a production disk pool and a test disk pool to separate which storage will be allocated to production and what devices are best suited for testing.

Auto-tiering within Disk Pools
Members of a disk pool may have differences in performance characteristics. SANsymphony-V uses sub-LUN automated storage tiering to dynamically match the best device to a given workload based on how frequently blocks of storage are accessed. This ensures that hotter-data resides on faster disk and cooler-data resides on slower disk within the pool.

Virtual Disks
Thin-provisioned virtual disks created from a node’s disk pool can be shared with other nodes in the virtual SAN. They appear as well-behaved logical drives to the operating systems or hypervisors that they are explicitly served to.

High-Speed Cache
Each virtual SAN node requires some amount of RAM to be used as high-speed cache. The amount of RAM allocated can be modified as necessary, but generally a minimum of 4GB or 10% (whichever is higher) of the host’s total available RAM is recommended for use as high-speed cache.

The purpose of the high-speed cache is to serve as a speed-matching buffer for writes and a large cache for reads. The result is conservatively a 3-5x performance increase over the native performance of magnetic disks. The use of RAM as read-write cache provides a significant performance advantage over virtual SAN products that only use slower flash memory as a read cache device.

Synchronous Mirroring for High Availability
SANsymphony-V provides continuous access to the shared storage pools even when a virtual SAN node is out of service. Critical data is synchronously mirrored between pairs of virtual SAN nodes to achieve high-availability. RAID protection within each node provides additional safeguards against component-level failures.

How the virtual SAN Works

The diagram below shows an example of a virtual SAN. SANsymphony-V (in red) is running in a dedicated virtual machine (VM) on each node alongside VMs hosting applications.

DataCore Virtual SAN
In the diagram above, the left two nodes are responsible for sharing a highly-available virtual disk with the other nodes that make up the group of servers. Each node pools its local flash and magnetic storage devices. Virtual disks are created from these pools and are synchronously mirrored between the two nodes. They are presented as multi-path disk devices over the network/fabric. The virtual disks may be sized as needed. Oversubscribing the storage is allowed since the virtual disks are thin provisioned to minimize actual capacity consumption.

The two left nodes, as well as any of the other virtual SAN nodes in the virtual SAN, can access the virtual disk over the network/fabric. Each node’s corresponding hypervisor recognizes the virtual disk as an available disk device.

In this same way, other nodes can contribute to the overall capacity of the shared storage pool. Each node adds more storage, processing, network and I/O power to the group.

Conclusion

The DataCore™ SANsymphony-V10 virtual SAN software can scale performance to more than 50 Million IOPS and to 32 Petabytes of capacity across a cluster of 32 servers, making it one of the most powerful and scalable systems in the marketplace. To help users evaluate the power of the new Virtual SAN capabilities and further educate themselves on the benefits of software-defined storage, DataCore is providing free access to a non-production use Virtual SAN software license. The free SANsymphony-V10 Virtual SAN software is now available for download at: www.datacore.com/Free-Virtual-SAN

Is Your Storage Highly Available, Or Simply Fault-Tolerant? – Part 2

Introduction
In Part 1 of this series we reviewed the principles related to high-availability and how high-availability and fault-tolerance differ. In Part 2 we will discuss ways high-availability is achieved and what other benefits can be realized from this type of architecture.

Abstraction: The Key To True High-Availability
If you recall, high-availability is the combination of component-level and data-level redundancy. Component-level redundancy is fairly commonplace in contemporary infrastructures. Everything from servers to storage offer component-level redundancies as an option which is why I say, this type of redundancy (ie. fault-tolerance) should be the absolute minimum requirement because it is so easy to achieve. So now the question is, “how is data-level redundancy (ie. high-availability) achieved?”.

As we previously discussed in Part 1, to attain the highest level of high-availability you would need to meet each of the six principles of high-availability. However, you can’t start down that road until you have a system in place that allows you to abstract away from the underlying storage hardware and simultaneously provide synchronous mirroring capabilities for the data across that hardware. The abstraction principle is the baseline requirement to achieving not only synchronous mirroring, but all of the principles related to high-availability. You can read a great deal more about abstraction here.

I’m Abstracted, Now What?
Once data-hardware decoupling has occurred, we now need a system that will ensure that the data synchronously coexists across the underlying storage hardware. Of course, not surprisingly, the same system that abstracted the data away from the storage hardware should also provide the mirroring capabilities. It wouldn’t make much sense to go through all the pain of abstraction only to stop there, right? If you are familiar with enterprise storage systems, you are certainly by now starting to say to yourself, “All of this looks a lot like software-defined storage”, and you would be correct. One of the principles of software-defined storage is “Improve Data Service Availability”. You can read more about software-defined storage principles here.

The Mechanics of Synchronous Mirroring
Now we finally arrive at the “how” portion of this discussion. If you are like me, you are not satisfied with simply accepting that something works, you want to know how it works. By understanding how it works, you gain further appreciation for what is being accomplished just like an artist appreciates a Picasso or a Rembrandt.

Let’s take a look at what needs to happen to achieve this synchronization:

Sync Mirroring Mechanics

A couple things to point out here:

  • The high-speed RAM cache is vital to the process because this is the component that will allow receipt and acknowledgement of I/O to happen as quickly as possible on both storage virtualization engines.
  • The high-speed mirror path(s) should be able to utilize either fibre-channel or iSCSI. Deploying iSCSI for the mirror paths should also allow both virtualization engines to be separated by significant distances (up to approximately 100km or so).
  • Since the data is synchronously mirrored to both nodes, the data should be fully accessible on both nodes simultaneously. This would eliminate any delays that would normally be associated with LUN trespassing or migration if a failure occurred on either node.
  • Besides adding data redundancy, performance is also greatly improved because of the additional channels, cache, and disks. Most systems today have the ability to load balance (or round-robin) their I/O requests against all available channels yielding better overall performance.

Let’s review how well we did in achieving the principles of high-availability:

✔   End-to-End Redundancy Achieved through component-level and data-level redundancy
✔   Subsystem Autonomy No storage disk subsystem inter-dependencies; systems are not aware of each other
✔   Subsystem Separation Separation achieved through long-distance mirror paths
✔   Subsystem Asymmetry Made possible through hardware abstraction and administrator choice of hardware
✔   Subsystem Diversity Made possible through separation and administrator choice of facility
✔   Polylithic Design Made possible through hardware abstraction and administrator choice of hardware

Conclusion
If you have been reading my blogs of late you will see a pattern emerging. Once again, it all boils down to abstraction. The need to break away from being tightly-coupled with the hardware is very readily apparent. So if the only way to achieve true high-availability is through abstraction, and the only way to achieve abstraction is with software (which should be obvious by now), then considering software-defined storage solutions makes a lot of sense. This is precisely what we are seeing in the market today. Until next time…