Deduplication effect on throughput
In general, disk-based backup solutions with deduplication provide faster restore throughput than tape as disk is online with random access. However, backup throughput varies based on the vendor because data deduplication is a resource-intensive process.
During writes, the deduplication process determines if a small data sequence has been stored before, sometimes up to prior petabytes of data. A simple index of this data is too big to fit in random access memory (RAM) unless it's a very small deployment. Many solutions need to seek on disk, and disk seeks are notoriously slow and not getting better.
The easiest ways to make data deduplication go fast are to be worse at data reduction, looking only for big sequences, so you don't have to perform disk seeks as frequently, and to add more hardware so there are more disks across which to spread the load. Both have the unfortunate side effect of raising the system price so that it becomes less attractive against tape from a cost perspective.
Vendors vary in their approaches, but we took a unique approach with EMC Data Domain systems, which leverage a central processing unit (CPU)-centric architecture to quickly and efficiently identify redundant data, enabling industry-leading throughput.
CPU vs. disk-centric (spindle-bound) throughput
Unlike EMC, many vendors leverage a disk-centric approach to deduplication. However, since disk drives are the slowest component in any storage system, in order to get greater performance it's common to stripe data across a large number of drives so they work in parallel to handle I/O.
If your system uses this method to reach performance requirements, consider the right balance between performance and capacity. This is important as the point of data deduplication is to reduce the number of disk drives.
With EMC Data Domain Stream Informed Segment Layout—an inline, CPU-centric approach—very few disk drives are needed to reach maximum performance so its deduplication delivers on the expectation of a smaller storage footprint.
Single-stream backup and restore throughput
Single-stream performance indicates how fast a given file or database can be written, read, or copied to tape for long-term retention.
Due to backup windows for critical data, backup throughput is what most people ask about though restore time is more significant for most service level agreements (SLAs).
Aggregate backup/restore throughput per system
With multiple streams, how fast can a given system ingest or recover data? This will help gauge the number of controllers or systems are needed for deployment.