Apr 04, 2010

Comression & Deduplication - Oil & Water or Milk & Cookies

Storage

Oil & Water?

Last week Mike Davis from Ocarina Networks published a blog post "Compression and Dedupe like Oil & Water?" It was a good piece and from what I understand, and I don't know Mike, but he will be taking over blogging as Sunshine has moved on to greener pastures and I wish her the best. The reason for this piece is because Mike made some interesting statements in his piece and I had some questions. I know the guys at Wikibon have ideas on this topic and I tried asking my questions via twitter and then on his blog but haven't received any feedback (trust me, I am not nieve, I know we are all very busy) so thought it would be interesting to share my thoughts and try to start some dialog.

Mike stated:

"If you apply a compression-only workflow to a dataset let’s say you get 50%. Now run the same data set through a dedupe-only workflow and you’ll get maybe 20% (remember this is primary storage not backup data). Now take those little chunks and pointers from the dedupe workflow and compress them; you might get an additional 35% for a total of 55%. So compression of deduped data is less effective than on the raw data-set, but the combination (for this example) has eeked out a 5% advantage over the compression-only workflow."

I understand Mike to be saying that if you used deduplicaiton and compression you could potentially get an additional 5% optimization of your storage over standard compression. My question is, At what cost? I don't necessarily mean $ cost either, while this is a factor, but at what cost to the end user and the IT administrator. When I think of capacity optimization for primary storage, here is what I believe the requirements are for IT:

Optimization cannot cause any impact to the performance of the storage array
Optimization cannot cause any change in downstream processes for the systems administrator
Optimization cannot cause any increase in storage management functions
*The solution needs to be heterogeneous (I just remembered this one)

If the optimization technology cannot ensure that these key storage functions are maintained, then quite frankly, the solution is not a solution for primary storage.

Lets think about this from IT's perspective. First, if I implement a solution that can't optimize data in real-time, then it must be done post process or once the data is stored. If this is the case, I need to find time on the array when the workload is low enough to allow the solution to perform the necessary I/O and hence load on the system. Given the fact that storage systems are busy with users more than 8 hours per day, there needs to be time to take snapshots as well as time to backup the system where is there time to perform the optimization of the storage?

Second, once the storage is optimized, is it readable? In other words, if a user needs some of the optimized data, can the application that wrote the data get at the data? If it is deduplicated and compressed, it cannot be accessed. (In fact, today the only deduplication technology that allows a file to be viewed (read only) is Avamar.) This means that if the data is required for additional processing, then IT must rehydrate the data in order for it to be used. This creates additional process work for the IT administrator as well as disk capacity for the rehydrated data so your not really saving the space.

Finally, any solution that interferes with IT business processes, such as backup, again are not good solutions. IT has spent an inordinate amount of time and money on their backup best practices. If an optimization technology that is used in front of the backup process significantly alters that process, then its ROI would have to be unbelievably compelling to make such a dramatic shift that it is almost unheard of. If a capacity optimization technology such as deduplicaiton is implemented on primary storage, it changes the files on the file system such that when IT goes to backup the new deduplicated file, it is considered a brand new file. Now, a file that would not have been required to be backed up (incremental backups don't backup unchanged files) now has to be backed up. Yes, this new file/blob takes up less disk capacity, but is IT going to go back and remove the non-deduplicated file from tape? So in reality your storing more. Additionally, how do I track this file/blob in my backup system? Is it indexed as a file? If not, how do I recover this file in the event of a disaster? Most importantly, how does a deduplication technology such as Avamar or Data Domain backup this new deduplicated file? The backup vendors with deduplication technology tell folks who encrypt or compress (and now deduplicate) their primary capacity to unencrypt, decompress or rehydrate their data before using their deduplication because if any of these characteristics are utilized then it will ruin their deduplicaiton ratio? How is this good for primary storage if I have to rehydrate to do backups? How is this good for the backup environment?

The new finally. Folks have asked about other solutions such as NTAPs compression, EMC's compression or ZFS. Again, all good solutions where given certain use cases would be a good fit but the problem with each of these solutions is vendor lock in, they are not heterogeneous. In order to keep maintain flexibility in IT, it is important to purchase heterogeneous solutions.

Lets also think about what the 5% actually means to the array. If I deduplicate 10 TB of capacity 20% I am left with 8 TB. The additional 35% of compression is on the 8 TB not the 10 TB so I still have 5.2 TB of capacity. Standard LZ compression, again depending on data type should yield 50% compression at a minimum giving you 5 TB of capacity. I think in this case compression would be a better solution.

Compression and Deduplication - Milk & Cookies

The reality is, as with every answer in IT is that for every use case there is an 'it depends' answer. Compression and deduplicaiton actually can co-exist. They can even co-exist on different tiers of storage if done properly. If compression is done right, in real time such that there is no impact to primary storage from an integration with the application perspective, performance perspective, or downstream process perspective then compression on primary storage is the right answer, and I would also say if you could deduplicate data in the same manner, it would be a viable solution as well but unlike compression, there are no deduplication solutions that can achieve these characteristics.

Now dedupication on 'primary' (and primary is in quotes again, its an 'it depends' situation) can be done if the primary storage is an archive where real-time data access and no secondary operation such as backup is going to occur. But if the 'primary' storage is active storage than the only way to do any capacity optimization, given the tools that are available today, is real-time compression. Additionally, in order to maximize the solution, IT would want real-time compression that utilized random access compression as well. If random access compression is used, then the impact to downstream process such as backup and deduplication are enhanced, not degraded.

Today Storwize is the only solution that provides IT with real-time primary storage compression, that is transparent to the application that wrote the data (meaning it can read the compressed data) done using random access techniques that doesn't require the data to be decompressed before the data is backed up.

Now when the data is backed up, if you use deduplication, the Storwize technology can enhance deduplication bringing a 10x optimization to a 14x optimization saving IT money not only on their CapEx expenditures, but on their OpEx expenditures as well. This is why I think compression and data deduplication are more like milk and cookies than oil and water, but I encourage your thoughts.

Tags:

Backup, Capacity Optimization, Compression, data compression, Ocariana Networks, Ocarina, Process, real-time compression, Storage, storage compression, Storwize