Making Storage Sense: What Else To Be Aware of with Post-processing Deduplication

W. Curtis Preston recently posted several entries in his Backup Central blog about post-processing dedupe speeds. The first entry asked why post-processing vendors, specifically FalconStor, Sepaton, Exagrid, and Quantum, do not publish their dedupe speeds and the second entry included agreement from each of the aforementioned vendors to publish their dedupe speeds. This, of course, is all good news and extremely helpful for end-users who are saturated with marketing hype about this or that dedupe solution.

As someone who has designed and sold VTL solutions from FalconStor, I want to call attention to a couple of points that are often overlooked by end-users when they evaluate a post-processing dedupe solution:

1) The dedupe performance capabilities of any node or set of nodes is a function of the amount of RAM in each node. The reason for this is that the seed data that will be deduped and the hash table which is used for the dedupe process is held in memory and when there is not enough available RAM, the solution will at best bottleneck and at worst stop the dedupe process. This is contrary to the ingest process, which is not RAM intensive but CPU and HBA throughput intensive.

For this reason, it is extremely important to size your nodes properly for dedupe, separately form the sizing for data ingest.; essentially the more RAM in the nodes, the better. However, one must also decide if the solution should be scaled up or scaled out. In other words, given financial and data center constrains, is it better to scale up to a few nodes with a large amount of RAM or scale out to a greater number of nodes with fewer RAM? That recommendation is best made by the individual vendor for their solution; but the point is that end-user needs to understand the reasons behind specific designs and choose the one that fits best in their environment.

2) One of the potential areas of bottleneck that is often overlooked is the storage controllers that provides the back-end capacity for the dedupe solution. Using the numbers that Curtis cites in his blog, let's assume an data ingest rate of 1 TB (1,000 Mb)/second and a dedupe rate of 500 MB/sec. if you are doing concurrent processing (which means starting the dedupe process when a virtual tape is written and "ejected" or a backup job is complete, instead of traditional post-processing where dedupe begins when all backups jobs are finished writing to disk), then you are looking at sizing for 1.5 TB/sec of throughput. If you then assume that a typical storage controller with quad-core CPUs can perform at ~400 MB/sec, with the typical I/O profile for a VTL/dedupe workload, you are looking at having four storage controllers in your solution. Depending on the storage system you are using for the back-end, that could be a single system with 4 controllers or 2 systems, each with 2 controllers; in either case, there is CapEx and OpEx costs to be considered. The number of storage controllers, of course, can be scaled back if your are doing post-processing dedupe where the dedupe process does not begin until all data ingest is completed.

There is also the issue of what will be prioritized if there is a throughput bottleneck during concurrent processing. Will the dedupe process be throttled down in favor of data ingest or will dedupe be prioritized? Personally, I would consider data ingest to be a higher priority since what is most important is finishing the backups. End-users should ask each vendor they are considering to explain how bottlenecks are handled. I should add that for this reason, I had several FalconStor personnel recommend post-processing be used whenever possible.

What I've detailed above is based on my own experience, specifically with FalconStor. I would invite any of the post-processing dedupe vendors to elaborate or correct what I have said. The main thing is to get the correct information to the end-user community so that they can make the right choice for their environment.

Making Storage Sense

Friday, June 4, 2010

What Else To Be Aware of with Post-processing Deduplication

No comments:

Post a Comment

Followers

Search This Blog

Blog Archive

About Me

Twitter