In my day job I am constantly battling with prospects need to store large amounts of data, for long periods of time. With growing compliance needs (S.O.X springs to mind) and a general perception that storage is almost free (thank you SATA 1TB+ drives), the need to store data for long periods of time is becoming much more common. Regardless of why data needs to be archived, keeping it safe and accessible can be very tricky. I have compiled a list below stating some of the considerations any organisation should take in to account when designing its data archiving policies:
Storage Medium
Firstly you need think about your storage medium (i.e. what the data will be stored on). Conventional wisdom says “write it to tape”, but it’s important to understand how long storage mediums will last before committing. Typically you will have the following choices:
- Tape
- Disk (in a storage array)
- Disk (portable USB)
- CD
- DVD
Each of the mediums above have a definitive shelf life. Tapes demagnetise over time (my crates of old ZX Spectrum games are a great example), individual hard drives can rapidly break and cannot be trusted after a 2 years or or less. Keeping data on spinning disks in a larger array, such as an EMC Data Domain, will incur a higher costs.
Choose What to Archive
Long term data arching can be an expensive operation cost, so it is important to creep data sets to a minimum. A hosted application may run over a number of Operating Systems, so identify where the important data resides and farm it off to the long term archive. Non-critical data will still need to be backed up to help with Disaster Recovery, but this only needs to be for a week or two.
Storage Format
Locking a tape away for up to 10 years is perfectly feasible, but when it comes to needing to get data off it, can you guarantee you will have a storage device that can do the job? To put it another way, if you found a MiniDisk or Zip disk in a cupboard what would you have to put it in?
It may seem crazy to think of CDs or DVDs being historical formats, but it is very difficult to predict what will stand the test of time. Picking main stream formats is one approach, but also be mindful of what your vendors product timelines look like.
Data Format
Have you ever been in the situation where you cannot open a file? No matter what programme you use it’s just not recognised, and it takes a good few Google searches to find the one bizarre application that uses it. When storing data for long periods of time, common file formats should be chosen, and if there are any doubts then the relevant software should also be archived.
Revisiting old archives
Archives should be revisited regularly (once every 12-24 months) and data integrity checked. During these checks data can be rotated on to new media. A process of monitoring storage and data formats can also be kicked off, to evaluate if they are becoming end of life.
Data Redundancy
Critical data should never reside on a single piece of media, hosted by a single company. Long term archives should be made up of 2 or more copies of the same data sets to help mitigate against the risks of floods, fires, and (dare I say) riots. At the same time, copies of critical data should not all be held by the same vendor. In these torrid financial times, what if the company you choose goes under half what through the archiving period? It is easy to agree what will happen in contracts, but if the organisation you have contracted to no longer exists what then?
Organising Data
As an archive grows to multiple media sets, and many potentially TB’s of data, knowing which tape/drive/file has the data you want becomes more critical. In some situations you may be required to have locked bins shipped back to your data centre, and a mistake on which bin to call back could be costly. Ensure you are diligent with your documentation, and label tapes properly to avoid confusion.
Online (Spinning) or Offline (at Rest)
When building an archiving policy, use your Recovery Time Objective as a guide to deciding whether an archive should be online (readily accessible on an archiving server or an array like a Network Attached Storage), or offline (on tape or other removal media).
Holding archives on online will incur a greater storage cost per GB, however it will a lot easier to access when you need it the most. Large volumes of data may be impractical to keep online (re-emphasising the need to archive critical data only).
Data written offline will incur a much lower storage cost per GB, and is a much more practical way of archiving large data volumes. In my experience most backup requests are for data that is less an 72 hours old, so daily tape rotations may lead to lengthy recovery times.
Long Contracts
I regularly hear prospects asking for long term archiving services (5 years or more), but then not being prepared to sign a contract of the same length. If you want a service provider to hold your data for many years, reside yourself to signing that commits you both.
Encryption
Security frameworks often ask for data to be encrypted, including when written to archive media. If encryption is required investigate the best way to encrypted data (strength of encryption, known vulnerabilities etc), what level to encrypt the data (application level, media level etc), and how you plan on managing the keys. Getting hold of a 5 year old tape that you cannot decrypt because you lost the key will be a real pain, but writing data and the encryption key to the same media set may not be sensible.