Data Domain vs. Simpana Dedup: Round #2

Round 2: Where to Dedup?

In this post I’ll discuss some of the most important data protection architecture differences between Data Domain and Simpana.  The most defining difference between Simpana and Data Domain is where the deduplication occurs.  The below diagram references the key differences.  I’ll discuss this in more detail below. 


Dedup In a Box

Data Domain appliances can be easily attached to any backup infrastructure where existing backup polices can be rerouted to take advantage of deduplication.  This makes them a “quick-fix” for anyone looking to add deduplication to their architecture tomorrow.

How fast a deduplication solution can weed out duplicate data directly effects the effective throughput of the solution.  Although the backup streams feeding into the Data Domain may be 5TB/hr or more, the vast majority of this data is deduplicated leaving only—perhaps-say—200GB/hr or so, to be written to the large, slow SATA disks used for Data Domain storage.  The CPU processing power of Data Domain and other dedup appliances allow them to sustain far greater backup throughput speeds than the disk drives alone would allow. Beware, though, all that CPU power doesn’t help help when reading data off disk for restores.  While dedup appliance vendors frequently boast about their impressive backup throughputs, very few release the same information for restore speeds.  This is for a reason.

Because deduplication only occurs after data has entered the data domain appliance, the Client Servers, MediaAgents and Network all still have to contend with the skyrocketing growth rate of data.  This means additional hardware for MediaAgents and 10GBE network infrastructure may be required to enable the entire, non-deduplicated set of backup data to be transferred throughout the infrastructure.

The approach also suffers a bit in flexibility.  If you run out of I/O throughput capabilities on the Data Domain, the solution is to buy another appliance.  If the I/O throughput is adequate, but you are beyond the appliance’s storage limits, you need to buy another appliance.  The appliance approach in terms of I/O can also become a bit of a bottleneck and EMC seems to have acknowledged this with the introduction of Data Domain Boost.

Boost is available for NetBackup and Networker customers and is really targeted to help reduce the I/O bottleneck of the appliance.  It does this by integrating a software library with the 3rd party backup software and offloading deduplication processing to the backup servers.  The documentation indicates this can result in an almost 2x throughput improvement.  This also helps reduce the network load between the backup servers and the Data Domain, but doesn’t help with the front-end network used by the clients to send data to the backup servers.

Dedup Everywhere

The major disadvantage Simpana Native Dedup has when compared with Data Domain is the inability to provide dedup capabilities for other backup vendors.  To gain the benefits of Simpana deduplication, you must “upgrade” your legacy backup software to Simpana.  However, with the robustness in the Simpana platform, there are far more reasons to upgrade than just deduplication alone.  Especially if you are considering an investment in Data Domain hardware, it might not be a bad idea to at least see what else you might be missing.

The greatest strength Simpana has over Data Domain in regards to deduplication is source-side dedup.  This approach carries with it a lot of nice benefits that really make the entire backup infrastructure more manageable.  In this approach, Content-Aware fixed-block deduplication occurs at the client server with very little increase in CPU usage (+15% on average).  This subtle increase is more than offset in two ways.  First, backup windows with client-side deduplication are typically much shorter as there is much less infrastructure contention during backup times.  So, while the client is slightly more impacted, it is impacted for a much shorter duration.  Secondly, Simpana deduplication enables Deduplication Accelerated Synthetic Fulls, or DASH Fulls for short.  DASH Fulls are similar to traditional synthetic fulls in that the MediaAgent synthesizes new full backups from the last full and consecutive incremental backups without client involvement.  However, DASH fulls are entirely different in that they don’t require the backup data to be read and rewritten on the MediaAgent.  Instead, only metadata is read and updated to generate a new logical, synthetic image of a full, with no backup data movement required.  This means full backups that used to impact clients for hours can now be done in minutes on the MediaAgent, with absolutely no client impact!

Another benefit of the Simpana approach is performance and flexibility.  While Data Domain is just now getting a taste of the efficiency gained when moving dedup closer to the clients, Simpana is already fully reaping those benefits.  Using source-side deduplication, a moderately equipped MediaAgent with Dual Quad-Core CPUs and 32GB of RAM can typically maintain up to an impressive 3TB/hr backup throughput and store around 96TB of deduplicated data.  If your I/O requirements outpace your data growth, another MediaAgent server can be added.  Or, if your data growth outpaces your I/O needs, additional capacity can be added to that MediaAgent. You also have the capability of determining where deduplication occurs.  Deduplication processing can easily be transferred to MediaAgents for any clients that already have an especially high CPU usage.  This approach offers the flexibility to add and manage resources where they are needed instead of always requiring another appliance. 

Additionally, since Simpana is an all software solution, you can easily deploy the most cost effective server and storage hardware available.  It is very easy to move from one hardware vendor to another for any reason.  I indicated above that restores from deduplicated disk are typically much, much slower than backups.  While every deduplication solution must depend on disk read performance to restore data, Simpana allows you to select the hardware performance needed to maintain SLAs.  Applications that require faster restore SLAs can be backed up to faster disks or even tiered as backup data ages.

Round Results:

Under closer inspection, the benefits of source-side dedup really come to into play.  Simpana source-side deduplication customers typically need fewer MediaAgents and far less network bandwidth than appliance dedup customers.  Source-side dedup also reduces these anomalies I call “stress-fractures” within your infrastructure.  Backup infrastructures running at max capacity typically seem to have a lot more backup errors and failures than the more moderately loaded ones.  Network errors, retries and timeouts greatly increase when you start pushing the limits of what your infrastructure is capable of.  Source-side dedup, by just doing data protection in a more intelligent fashion, alleviates this pressure throughout the backup path and really helps things run much more smoothly as a result.

1 comment to Data Domain vs. Simpana Dedup: Round #2

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>