EMC Data Domain vs. CommVault Simpana Native Dedup

Disclosure: I was employed as an Enterprise Systems Engineer with Commvault at the time of writing this series.  The content of this series was not reviewed, approved or published by CommVault.

I normally don’t do competitive blogging, but this topic comes up a lot lately.  I’ve had a chance to visit with many customers in my area, some of which use Simpana native deduplication and some of which have offloaded that dedup to Data Domain.  I wanted to provide a comparison between the two approaches and share why I think most customers would be better off using Simpana Native Deduplication.

First, let me begin by comparing deduplication technologies.  Data Domain offers a robust appliance built on years of improvements with a lot of strong deduplication capabilities.  CommVault is also not new to this space with Simpana 9 making a strong leap forward as CommVault’s implementation of their integrated 3rd Generation Deduplication.

Round 1: Content Aware Fixed Block vs. Variable Block

The initial advantage Simpana has over Data Domain is visibility into the host filesystem.  Because Simpana deduplication is part of the greater “Common Technology Engine” that drives all Simpana data management modules—including data protection—Simpana agents will be installed directly on the host.  This allows Simpana to be “Content-Aware” or, in other words, easily understand the file, database structures or other object boundaries of the content being backed up.  This ensures deduplication hashes (the fingerprint used to determine segment uniqueness) always begins at the beginning of the object without additional processing required to find these boundaries later in the stream.  As a result, other object meta-data can be skipped over; preventing it from interfering with the hashes and hindering the deduplication savings.

Content-Aware Deduplication:

Since Data Domain is a target appliance, it has no visibility into the file and object boundaries as they existed on the host.  Data Domain addresses this with an approach called “Variable-Block” or “Variable-Length Segment” deduplication.  Although Simpana and other backup solutions back up files and objects, the backup stream is processed into chunks and then written, along with metadata, to the backend disk storage or Data Domain device. Since Data Domain receives only chunks and metadata, it has no knowledge of file boundaries.  The appliance searches the backup stream with a rolling hash looking for patterns that may resemble content and flags “anchor-points” to partition segments along these boundaries.  While this approach does take additional processing power (hence the appliance approach), it does seems to be effective.

One additional benefit of a variable-length segment approach is its ability to adapt to content that is shifted within files.  The biggest offender of this are database backup dumps.  Many customers backup databases locally and then sweep those database dumps off to longer term storage using their enterprise backup software. New data inserted into the database can shift other, unchanged content in the next database dump to a new alignment which causes most of the dump file to hash as “unique” during the next backup sweep.  Data Domain’s Variable-length deduplication can help detect and adapt to these shifts.  Simpana can also address this scenario in a few ways.

First, Simpana has a similar capability called Variable-Content Alignment.  This feature allows the backup stream to automatically shift the fixed block deduplication boundaries on the fly to compensate for any shifting of file content between backups.  Like variable-block deduplication there is some additional processing overhead.  The other option—a better approach in practice—would be to leverage Simpana database agents to perform streaming backups of the database itself using the same content-aware deduplication approach initially described.  This would enable all of the advanced database protection capabilities with Simpana while also avoiding the above dump file scenario entirely.

Round Results:

Personally, I think this round is a bit of a wash.  What Data Domain may initially lack in true content-awareness, it makes up for in brute force.  Both methods, however are effective and generate strong deduplication savings.  The feedback I get from native dedup and Data Domain dedup customers alike demonstrate similar deduplication ratios between the two approaches.  Compelling differences will need to be identified elsewhere in the architectures.

In my next post we will look at some of the fundamental differences between the two approaches.

2 comments to EMC Data Domain vs. CommVault Simpana Native Dedup

  • Danny2

    We have spent a truckload on CommVault with tape: current scenario.
    I would like to have an objective comment on now taking on Data Domain for backup, restore and archive via automated process (dics based) versus upgrading current scenario for 6000 users

  • Hi Danny,

    Thanks for the question. First, if you are only using tape today, you should definitely be considering disk as a backup means. Actually, you should have been considering it years ago. 🙂

    I typically start with a 3 Tier backup approach that varies depending on backup, restore and retention requirements. Some tiers may not be required depending on your needs.

    Tier 1: Snapshots/Clones
    -Function: Leverages storage array-based capabilities to provide rapid backup and recovery of full datasets
    -Capabilities: Usually rapid backups and recoveries
    -Best Retention: A few days (typically full restores need the latest good data)
    -Cost: High as retention increases (it consumes primary storage disk space)

    Tier 2: Disk Backup Media
    -Function: Leverages low-cost disk storage or deduplication for most common restore operations.
    -Capabilities: Faster random-access recoveries, acceptable full restores.
    -Best Retention: A few days to a few months (Somewhere between 90 days and 6 months you should be looking at moving backups to tape)
    -Cost: More expensive for short term or long-term storage, but effective at moderate retentions. (The sweet spot for deduplicated disk is retentions over a few weeks but less than 6 months. This allows deduplication to increase storage efficiency without bearing the cost of multi-year archives on expensive disk media)

    Tier 3: Tape
    -Function: Lowest cost media for long-term archival/compliance storage
    -Capabilities: Slowest recoveries (although steamed full backup recoveries are rather fast)
    -Best Retention: 90 Days to Infinite
    -Cost: Higher initial investment for tape infrastructure, but very low long-term cost.

    With the above in mind, you should be looking at introducing deduplication into your environment for at least the 0-90 day retention requirements. If you don’t have long-term requirements (over 6 months), you could consider eliminating tape entirely. However, tape is still the best target for long-term retention, in my opinion.

    If you struggle with tape it is because you are probably relying on it too much for a recovery tier. It is not a good recovery tier–but it is a damn good long-term archive tier. If you utilize disk and snapshots for restores, you can get by with fewer tape drives, less management and easier refreshes of tape media. Tape is very manageable when used properly.

    Data Domain, by itself is not really a backup solution. It is a target device for storing backups. It relies on backup software for data transfer, retention, catalog and job management. So you will need something *like* CommVault even with Data Domain.

    However, Data Domain is the gold-standard in hardware-based deduplication. I can’t think of any customer complaints other than cost–and they have been improving that with the latest platforms. Native CommVault deduplication will also do the job and may allow for more CommVault specific capabilities (like synthetic fulls and client-based dedup). Also, it will most likely be less expensive, but you *MUST* follow the specifications when designing it or you will be disappointed with the outcome. You will probably spend a bit more time on the day to day management and troubleshooting of CommVault dedup than Data Domains as well.

    You may take a look at this post for some more insight on where I believe dedup technology is heading. Also, feel free to ask more questions!

    Virtual Tape is Dead, Dedup Appliances are Next

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>