Efficient management and
storage of data is usually a problem that most organizations face these days.
There are various methods and technologies which are in place to solve this
issue. The amount of space available for storage must be used efficiently so as
to store maximum data in minimum space. Data Deduplication is a method which
looks for repetition or redundancy in sequences of bytes over a large
collection of data. The first uniquely stored version of a data sequence is
referenced at further points than be stored again. Data deduplication is also
known as intelligent compression or single-instance storage method.
File
Level Deduplication
In its most common form, deduplication is done at the file
level. It means that no file which is identical is stored again and this is
done by filtering the incoming data and processing it so as to avoid repeated
storage of the same file unnecessarily. This level of deduplication is known as
single-instance storage (SIS) method. Another level of deduplication occurs at
block level, where blocks of data that are similar in two non identical files,
are identified and only one copy of the block is stored. This method frees up
more space than the former, as it analyzes and compares data at a deeper level.
Target
Level Deduplication
The second type of implementation is at the target level which
is the backup system.The deployment is easier compared to the first source
type. There are two types of implementation – inline or post process. In inline
implementation the deduplication is done before the data is written or stored
to the backup disk.This requires less storage which is an advantage but more
time as the backup process can be completed only after the deduplication
filtering is done. In the case of post process data the storage space
requirement is higher but deployment happens much faster. These methods are
chosen depending on the system, the amount of data to be handled, the storage
space available for the system as well as back up, the processor capacity and
the time constraints.
The greatest advantage is less storage requirements which
improves bandwidth efficiency. As primary data storage have become inexpensive
over the years, organizations tend to maintain backup data of a project for a
longer period so that new employees can reuse certain data for future projects.
These data storehouses need to have cooling process with proper maintenance and
hence consumes a lot of electric power. The amount of disk or tapes which the
organization needs to buy and maintain for data storage also reduces, thus
reducing the total cost for storage. Deduplication can reduce the
bandwidth requirements for backup and in some cases it can also boost both
backup and recovery process.