It’s time I get back to writing about more than just topics that surround what IBM are doing in the world of storage. For example, I am pretty passionate about data protection. Having worked for Veritas, ESG (covering data protection), Connected Corporation (sold to Iron Mountain), Avamar (sold to EMC) and EMC’s backup, recover and archive group, I have spent a good chunk of my storage career in the data protection space.
The first thing that has always puzzled me is how different storage vendors define “data protection”. In big companies the definition of “data protection” is usually always political. The fact that backup, archive and replication are never usually in the same group really is only due to company politics. The reality is, any “copy” (we will come back to this) of data, is for the most part, all about protecting data. Companies like CommVault may have different product managers for capabilities such as backup or archive, but they don’t live in completely different organizations just because one group deals with hardware and another group only deal with software.
At the end of the day, when protecting a company’s most valuable assets, the data (which is about protecting the company) then all the protection capabilities should live in one group in order to drive the best synergy between them. Some vendors may say, “Well, that is not how customers buy the solutions.” In a number of situations, that may be true, specifically when it comes to archive for compliance, but isn’t that another means of “protecting the business”? Additionally, by percentage, compliance archive is giving way to operational archive, as a means for removing data from the backup stream and gain back some of the backup window. These facts bring me to two points.
First, I was impressed last week when EMC came out and said that data protection is really an extension of data management. This is a reality. When I worked at EMC, we often stated that the data set that makes up data protection is 4x larger than the primary data set. Given the ever increasing pressure IT is under and when IT starts analyzing the amount of data they are responsible for, how to best protect it and keep costs manageable, they need to understand data throughout its lifecycle. It’s not a popular phrase, but Information Lifecycle Management (ILM) is understanding the data throughout its life and that is really data management. Knowing the value of data upon inception is key to understanding how to best protect it.
Today that solution seems to be making copies of that data until the cows come home. Between clones, snaps, replicas, backup copies, archive copies and tape copies, multiple copies of data is a way to ensure businesses do not lose their data. But does that mean business are meeting their SLA’s? How does having 20+ copies of the data meet SLAs? The trouble comes, when adding a new technology such as CDP to meet a new SLA. IT is so busy keeping the wheels on the bus, they forget to update the old way of doing things. Before you know it, you have multiple protection solutions making copies of the data eating up a lot of space. At one point early in my career at EMC IT did a study and for every copy of an email anyone had, IT had more than 26 copies. This is how IT ends up managing more data than necessary which costs a lot of money. Today data protection is about managing your copies of data.
This brings me to my second point. Copy Data Management (CDM), is a new category of data management that IDC has begun writing about and has just done a marketing sizing (link). Companies such as Actifio and Delphix are two companies that are in the CDM space. CDM is about managing the number of copies of data by taking advantage of a system that knows, through a set of policies, how many copies of a piece of data need to be kept and where they need to be kept. The estimates are that companies can save up to 90% of their true “protected” (backup, archive, replicated) data space by leveraging CDM services. This will become the next wave in data protection. CDM leverages capabilities such as deduplication, compression, single instancing as well as copy or replication services to keep the data footprint optimized.
The real challenges however, and no one seems to have tackled it, is how to solve the problem of data classification. It must be that today it is still cheaper to keep unnecessary data around than it is to properly manage it. However, in order for CDM to really work, a policy needs to be set for the data that is to be copied such that the right number of copies are created and stored in the right location in order achieve the proper SLA or RPO / RTO.
So what is the real future of data protection?
I have been singing this song for at least 5 years now. It is “Time Machine” for the enterprise. It is integrated CDM driven by policy, starting at the storage array. The real vision is to get rid of cumbersome backup software and integrate the key data copy services (see figure 1) directly into the array. Additionally, a policy engine that lives in the array that asks a few questions when LUNS/filesystems are being created that allow the administrator to set a “protection” policy for data that gets written to that LUN/filesystem. The moment data is written to the LUN/filesystem, a copy of the data is created and moved to the protected copy data repository. Just like “Time Machine”, the way CDP behaves today but for all data. Additionally, the CDM device leverages single instancing, deduplication, and compression capabilities to maintain an efficient disk footprint. Also a policy can be set, just as with “Time Machine” as to how many copies that you want to keep and for how long (for archive purposes) as well as where they should be kept for high availability (replicated/copied to a secondary CDM device). These copy services should become a part of the array. No more managing complicated backup agents, archive agents or cumbersome replication software. It’s time to take the management out of secondary storage management and make it a part of primary storage management. Additionally it is time to reduce the burden of costly software licenses and growing storage (disk and tape) for data protection. The solution is “Time Machine for the Enterprise” but until then, the next step is CDM, Copy Data Management.