Oct 07, 2009

Comprehensive Capacity Optimization - Deduplication 2.0

Cloud

Technology is great isn't it? When someone thinks they have a new idea on the same old technology foundation they call it "X 2.0". I have been watching the banter between analysts and vendors (specifically NTAP’s Dr. Dedupe and Permabit’s CEO Tom Cook) on the topic of Deduplication 2.0 and it is my belief that the proverbial boat is being missed (since we are using water analogies). I have been watching these guys hash it out for the past few weeks and decided I have to jump in. I find the real value to these conversations is the value to the end user. At the end of the day, it doesn't really matter who 'coined' or 'invented' a term (like deduplication 2.0) but what does matter is if the term actually helps describe a technology and how that technology can be leveraged to make things better in the data center. We should focus on the implications of this new generation of deduplication - ‘deduplication 2.0’.

In May I delivered a presentation to a number of EMC customers on the topic of Data Deduplication 2.0 - Comprehensive Capacity Optimization. The point of my presentation was simple (and keep in mind this was before the Data Domain acquisition); there are a number of capacity optimization technologies/capabilities that are available to customers today. Originally these deduplication technologies were used primarily for backup purposes but slowly, deduplication is making its way into primary storage. Deduplication in primary storage makes a lot of sense FOR DATA THAT IS STATIC. Why only static data? Static data is data that isn't used frequently (doesn't mean it's not important, it just simply is not accessed often); because access to this data is infrequent, the performance requirements for this data is less than that of active data. Remember; nothing in IT is free. If I deduplicate data, in order to use it, I must ‘rehydrate’ it and thus there is a performance implication so I want to be careful where I deduplicate data so as not to inhibit performance on production data.

Dr. Dedupe and Tom allude to Deduplication 2.0 moving beyond backup storage and into primary storage. While deduplication in primary storage is technically possible, it is important that customers understand two important points:

1) Performance: whatever I do to deduplicate (I like optimize) capacity in order to save space, I must ‘undo’ in order to use the data. If I set a policy that says any data that is 30 days old can be ‘optimized’, I need to be sure that data 30 days old is not active or I could pay a substantial performance penalty when using this data. I may set a policy ‘any data that hasn’t be touched in 30 days, can be optimized. I would just want to make sure that there is no scenario where at the end of a quarter let’s say, I would need to rehydrate all data in order to run some report.

2) Comprehensive and cumulative deduplication throughout my storage tiers. What do I mean? If I compress and single instance (deduplicate) data on my primary storage utilizing one set of deduplication technologies, say single instancing and compression algorithms, and then I backup this data using sub-file deduplication, a separate set of algorithms, then what I am left with are two separate sets of deduplicated data silos, and no one wins in this scenario.

It is important, no matter what deduplication technology you decide to use, that you can actually leverage the data stored in the deduplication device and that as data moves from device to device it doesn’t need to be rehydrated before it is moved.

A great use case of capacity optimization in primary storage is how EMC evolved the Celerra product this year. Through a policy, let's say any data that is older than 30 days, is compressed and stored as a single instance, with users seeing as much as 30% to 50% storage savings.

The real goal of Deduplication 2.0, and I think Dr. Dedupe alluded to this in his post "The Dedupe 2.0 Pundits Are Still Swimming in Lake 1.0" is that customers win when deduplication technology is a part of the core system or file system, when I no longer need to rehydrate data as I move it from primary storage to secondary storage. If each storage device in the 'stack' understands the language of the device in the stack ahead of it and the 'deduplication' or file system is coordinated and cumulative from device to device than the customer is the winner. This pertains to primary storage, backup storage and archive storage. Never having to rehydrate data allows for more efficiency and a reduced tax on devices that can save the end user money.

Tom Cook, CEO of Permabit points out in his blog post "Dedupe 1.0 vs. Dedupe 2.0: The debate ensues" that the only value to deduplication for primary storage is to move your data to a deduplicated archive which allows you to store data, efficiently, long term which I agree with, but as we have seen, not that practical. Why? Because at the end of the day, the costs to manage storage are going up, up, up and the costs to buy storage are going down, down, down. End users (NOT IT) are generally lazy or should I really say, just too busy to manage this storage. In order to properly archive data, you need to have a policy that tells you what to move and when to move it. IT can make all the recommendations in the world about the value of archive, but if users or really, lines of business managers don't tell IT what data is important and what can be archived, then IT doesn't really have a choice, which makes the premise of moving data to an archive, deduplicated or not – moot.

The real issue is balancing capacity optimization (to what granularity you deduplicate data) against performance on the appropriate tier of data, given that deduplication will happen on all tiers of storage. The higher the performance requirements (tier 1) the less 'optimized' I make the data, the lower the performance requirements (tier x, archive) the more optimized I make the data. The benefits to the customer are that I can A) optimize data, consistently among each of its devices, and B) it can be cumulative from device to device, removing silos of deduplicated data across the stack.

For more on tiered dedupe, read my Betamax Redux blog post on EMC's vision for deduplication and hopefully this will put you on a high performance ‘Road to Recovery’.

Tags:

Archive, Avamar, Backup, Data Deduplication, Data Protection, Deduplication, Disk Library, EMC, protection, Replication