Another great post from my colleague Mike Dutch

Many users believe that their backup tapes are their archive as well. Additionally deduplicating storage systems are driving a similar notion that a backup platform and archive platform could be common. Opinions definitely vary on this topic so I encourage all to comment. Let’s take a deeper look…

The reason you “backup" a set of data is because you might need to recover the primary data if it becomes unavailable or corrupted. If you want to access a data set as it existed at a particular point in time but couldn’t, you could replace the primary data with the backup copy. (SNIA defines backup as … “A collection of data stored on (usually removable) non-volatile storage media for purposes of recovery in case the original copy of data is lost or becomes inaccessible; also called a backup copy.

The reason you “archive” a data set is because you want to preserve it. It remains the primary data but because you'll rarely access it, you want to put it somewhere safe just in case you ever want or need to access it again. The SNIA Data Management Forum defines an archive as "a specialized repository (including the supporting processes, policies, hardware, and software) used to preserve information and data for the long-term." The capabilities of an archive "include the ability to preserve, protect, control, maintain authenticity and integrity accommodate physical and logical migration, and guarantee access to information and data objects over their required retention period."

Regardless of whether archive should be used as a noun or a verb, the point is that the purpose and therefore the lifecycle of data in an archive repository differ from a backup copy. While few would disagree with this premise, I'd wager that most people believe this implies you must store and manage these copies separately. You can, but you don't have to if you're using a data protection solution that fully supports your business processes.

Someday, the notion of data protection will be subsumed by the notion of data storage. If we store data, why shouldn't we expect to get it back when we want it? Why shouldn't we expect to resume an application from whatever point in time we want to? If the system can’t do this, is it really protecting my data? This leads us to the question of what data protection is.

The SNIA definition of data as "The digital representation of anything in any form" obscures its richness (sight, sound, touch, smell, taste). After all, shouldn't analog information such as printed books be considered data? Of course, a dictionary is not an encyclopedia and a definition should be succinct. I'll read the SNIA definition as meaning, “Data is something that can be processed by a computer after any format transformations as necessary.”

Let's posit that data protection means assurance that data is accessible to authorized users with acceptable performance in an auditable manner. Sounds reasonable yet this definition exceeds the usual scope of data protection. Data protection is usually measured in terms of availability metrics, that is, in terms of RPO and RTO. We also want assurance that data has not been altered or destroyed in an unauthorized manner (data integrity). And of course, we don't want our data to be available to anyone that should not have access to it, whether "leaked" over a network or by losing control of physical storage media. Even if an authorized change was made, the user may change their mind and want to access an earlier version of the data. Also, what about poor performance? I know I'll find something else to do if application performance degrades to a point when I cannot remain productive. Unacceptable performance equates to unavailability. Auditable means the ability to verify who controlled what when (to comply with GRC initiatives and provide a chain of custody).

The traditional definitions of operational recovery and disaster recovery, distinguished by the impact of the outages (whether caused by operational errors, data corruption, or hardware failures), are subsumed by this accessible-performant-compliant definition of data protection. Retention and long-term preservation of fixed content (and related metadata) within an archive repository also falls under the "ensure data is accessible" umbrella of our broad definition of data protection. Regardless of whether performance and compliance capabilities are included in your definition of data protection, they remain requirements of conducting an effective and responsible business.

Let's get back to the main idea of this post, namely, that while it is necessary to MANAGE backup and archive data separately, it is not necessary to STORE backup and archive data separately. Storage systems with data deduplication capabilities are one proof point. An accessible-performant-compliant definition of data protection broadens the opportunities for both resource sharing and risk reduction. Data protection is much more than backup and archive. It's about keeping your business fit by ensuring its lifeblood, its data, is clean and flowing freely.