Apr 07, 2010

Tomorrow's Compression Technology Today

Storage

In a recent blog post by Carol Silwa, Dave Russell from Gartner gives us his perspective on the status of current techniques for reducing primary storage. I have huge respect for Dave, and I think he does some fine work. In fact, Storwize is even a client of Gartner. But I must say that in this case, some clarification is in order. Allow me to elaborate:

It appears to me that Dave is thinking about how compression has been used historically, which is from the client perspective. This is old school, and it has been done for years. It is true that LZ compression has been around for 30+ years, as Dave notes. It has recently been re-applied by Microsoft in Office 2007. All of those .docx, .xlsx, and .pptx files are compressed before they are sent using LZ. Bravo! But that does not help overall primary storage, just the Office 2007 application data. Here are Dave’s quotes:

“Compression is really looking at a very specific amount of data. One everyday example might be a sound file. An MP3 is an example. Compression is just looking within that individual object, in this case, one single music file, and doesn't really persist any kind of data reduction across other types of files or data that it's going to process later on.” “So, the opportunity on primary storage might be a little less, but it's still very significant, especially for so-called unstructured data or things like word processing documents, spreadsheets, PowerPoints, which tend to have not only a lot of commonality but a lot of situations where even one individual saves a file with very, very similar data multiple times.”

So what is the issue? Primary storage gets all kinds of file types thrown at it, and it is rarely redundant in real time. That means that deduplication techniques may not provide anywhere near the big savings as they do in archiving and back up. The trick for primary storage is to reduce capacity on the fly, without any disruption to the application. It is important that there is no performance degradation, and when the application wants its data, it’s readily available.

Now many have tried to use regular old LZ compression on primary storage in a common form such as WinZip but it is inadequate. Why? Because primary storage traffics active data, and active data gets modified a lot. When you modify a file that has been compressed with LZ, you need to decompress the entire file to modify it, which takes lots of processing and may impact performance. Then if the modification is too large, you need to store it somewhere else. This creates holes in your storage that is akin to ‘swiss cheese.’ Over time the holes grow, and all of the savings from compressing the data are lost. Heck, we have seen instances where the compressed data actually grows bigger than the original data itself! It’s called fragmentation.

I’m sure Dave knows that traditional LZ is not the right solution for primary data. But Storwize does! Why? Because all of our IP is devoted to making LZ real time and random access. ‘Real time’ makes compressed data transparent to the application, with no performance degradation. Dave recognizes the real time aspect of this when he says: “…this tends to take a certain amount of processing power, particularly CPU, we're going to see more advancement in chip technology and that is going to be much more cost-affordable. The speed of being able to process data and potentially to do that more in an inline process rather than land all this on disk first is very likely to come about in numerous products.” But simply throwing CPU at this suggests he is thinking about speeding up post processing techniques. Throwing more horsepower at this problem is not the solution, you need to approach the act of compression from a different angle. Storwize has solved the ‘real time’ compression riddle asking a different question. Instead of starting with a fixed input and creating a variable size output, Storwize actually takes in variable input from the application and writes a file with fixed output such that the Storwize appliance knows exactly where the compressed ‘chunks’ of data live inside the compressed file. By knowing this, there is no need to decompress a file in order to edit the file, thereby reducing the I/O required to write data.

Plus, there is an added bonus. Because of the way we store the compressed file we take deduplication to a whole other level. When deduplication vendors are telling end users to decompress their files before they deduplicate, because compressed files ruin deduplication ratios, Storwize doesn’t ruin deduplication, we actually make deduplication more efficient. Dave also says that new innovations are coming that combine deduplicaton and compression, i.e. “Today one vendor may only offer deduplication; in the near future, they may offer compression on top of that; and where potentially today they only offer compression, they may expand into dedupe as well.” I don’t want to tell Dave anything but most deduplication vendors have compression – I know Avamar and Data Domain do.

Storwize compress data in real time and using random access techniques and you can still use your trusted deduplication backup solution. Quite simply we have tomorrow’s Storwize compress data in real time and using random access techniques and you can still use your trusted deduplication backup solution. Quite simply we have tomorrow’s technology, available today.

Tags:

Compression, data compression, Dave Russel, Gartner, LZ, Storage, Storwize, winzip, zip