Last week I took a course offered by EMC entitled ‘Lean Six Sigma' - Yellow Belt. This is a training course that is used to help ‘solve problems' in a given process, typically work related. When I think about where the biggest problem is in IT its in the Backup arena so I thought, what a better place to test it.

There are two components to Lean Six Sigma. Lean or Leaning a process is about removing excess from a process to make it more efficient. For backup, moving as much data out of the backup stream as possible would increase backup efficiency. Deleting unnecessary data or archiving static data in the production storage can cut down on as much as 50% of the data in the backup, ‘leaning' the process.

Next, when looking at Six Sigma, we learned about the DMAIC process. That is:

  • Define - Business case, scope, problem statement, goals
  • Measure - Process flow, run charts, Pareto charts
  • Analyze - Cause / Effect, waste identification
  • Improve - Waste removal, improve plan, control charts
  • Control - Monitor to prevent repeat failure, control charts, control plan

First, as I was thinking about this, I kept coming to the measure phase. If you don't currently measure your backup process, unless of course only when there is a recovery failure, then perhaps its time to invest in a tool to help measure the current process. This measurement will allow you to identify current problems, serving as a benchmark against wich you can measure the success of your 'leaning'. So, if we apply the steps in the DMAIC process to your typical backup environment, here is what it may look like.

Define

In the first step or the define step the objective is to describe the business case and problem statement and set SMART (Specific, Measureable, Attainable, Realistic and Timely) goals. Again, the key will be in the measurements but typically with backup you want to measure recovery success, which ends up being a result, most of the time, of backup success. The objective is to take a look at your existing recovery success rate, and hence your backup success rate and identify what you would like the percentage of successful backups and recoveries to be. I would guess in most cases shooting for 100% would be the requirement, but perhaps 99% is fine. So the problem statement would be: data recoveries fail more than 54% of the time and this data loss contributes to employee frustration and can translate into significant risk for the company during a legal disclosure process. The reasons these recoveries fail are due to a flawed, multi-step process that needs to be examined and fixed in order to yield a 99% success rate when it comes to recoveries. This process specifically affects backup administrators on a daily basis. When the process is fixed and recoveries yield a 99% success rate, the customer benefits, the end users, executives and customers satisfaction will keep the company's corporate costs low and drive repeat business. Additionally, in the define phase, it may be good to create an IPO diagram. IPO stands for; Input, process, output.

Measure

In the Measure phase you will want to make sure you have metrics that can identify the following; recovery success rates, backup success rates, dollars lost due to failed data recovery, customer complaints due to failed data recovery, and the costs to backup and recover data for the environment. These would be the key metrics to understand and to fix the problems that are uncovered. It will be important to establish a baseline to improve upon, and this is where having tools in place (such as DPA) can help tremendously throughout this process. This will also be a good place to ‘map out' the current process flow and make sure to identify what is in scope and what is not in scope in order to avoid ‘scope creep'. To review the process flow it may be useful to create a process flow chart using a whiteboard and post it notes. Brainstorm all of the steps and put them on the whiteboard in random placement. Organize in time sequence and then fill in the missing steps and review for completeness. This will be helpful for the analyze phase.

Analyze

Next comes the Analyze phase. This is one of the best places to start to identify ‘waste' in the process and see how the process can be ‘leaned'. (It is also, particularity for the backup process, a good place to see where the data can be ‘leaned'.) Identify waste and poor performing areas of the process for both the backup and recoveryflow. Brainstorm as to where the process breaks down and what pieces may fail. It will be important to take a look at the overall daily trends as well as the weekly trends to see if there are any anomalies in the process. Typically backups are daily incremental and weekly fulls, so you want to make sure that there are no flaws in either process in order to achieve 99% data recoverability. It may be useful to develop a ‘Cause & Effect Diagram', such as the one shown below, to find all of the problems.

Improve

Now comes the Improve phase. Utilize the whiteboard and your Post Its again to review a new process flow. Don't let the existing tools limit where you mind may go. Think out of the box. If part of the problem is to recover data 99% of the time ‘company wide' and that includes remote offices, there may be a reason to use other tools at these offices in order to meet the objectives. Build out a ‘mistake proof' process. Don't worry, you can analyze the costs afterwards, but identify the ‘best case scenario'. Be sure to document the new process. The next step is to implement some of your changes and see how your new process is working out. It will be very important to utilize the same tools and measure the new process against the old . You will want to utilize the same charts as before with the new data to show improvement. Based on your results you still may want to take a look at refining the process a bit more now that it is in action and new, unexpected issues pop up as a result of the new process.

Control

Finally, you will want to make sure you control the new process. It will be important to use the same tools to continually monitor and manage the process and to make sure you stay within the new specifications of 99% recovery. If there is ever a situation where you fall outside of the range, you will need to review the process again, identify where the process broke down, fix it and go through the whole DMAIC process again. There is no sense going through all of the prior work to not manage it afterwards to make sure the process stays in compliance.

Keep in mind, requirements may change and other outside factors such as data growth can, and will, have an impact on this process and may force you to re-look at the process or the tools used to manage the process. By continuing to measure and control the process, you will see when you start to fall outside of the critical success criteria and need to make adjustments, but it will also allow you to operate at a much higher recovery level than you have in the past. Trust me, follow the lean Six Sigma process, it will help you take backup, beyond and put you on the "Road to Recovery".

Posted by Steve Kenniston

Tags:

Backup, Data Protection, Data Protection Management, Process, Recovery