Quantum White Paper Sample
Data deduplication systems differ in how they carry out the capacity optimization process, but all of them perform some kind of comparison of data segments to identify the ones that are repeated and that can be replaced with a reference or pointer. One method, file-by-file comparison systems, looks at two versions of the same file or fileset and looks for unique data. A second method, employed in block-based systems, divides data into segments and uses a mechanism for remembering which blocks in the data set of interest are unique and which have already been written to disk. In each case, the ultimate result of the process is that only unique data segments are stored while repeated ones are referenced instead of being stored again. When there are several versions of similar data sets to be retained and accessed, either of these technologies can provide very powerful savings. Both systems can also support highly optimized replication in which only the unique segments need to be transmitted—again, assuming multiple similar data sets where similarity and uniqueness of elements can be identified.
The ability to locate and store only unique data segments is common to all deduplication approaches, and an inevitable by-product of the process is some level of system overhead—it takes more processor cycles and more time to find the redundant segments and compare them to what has already been recorded than it would take to simply write all the data, redundancy and all, directly to the disk medium.