February 15, 2011 1:46 PM
Just because you can doesn't mean you should
I have been doing some work on a WAN design for a large private sector communications organisation. As usual, the difficult bit has been pinning them down as to what they really need (not want - need) to support their organisation.
The list of requirements to begin with read more like a techy wish-list than anything vaguely business-focussed, but we managed to get them to think about things from an application and user perspective, rather than just listing all the cool tweaks they wanted on the routers.
Then we started on SLAs. In particular network convergence and acceptable outage times. "The network must reconverge in 50msec;" we were told. On asking why, we were told that the 'old'WAN did, so this one had too as well. Of course, the old WAN is SDH-based, and one thing that SDH is very good at is recovering from failures quickly. That 50msec figure that originated in the days of circuit-switching just keeps coming back to haunt us.
Now it is quite possible to make an IP-based network recover, reroute and reconverge in 50msec. But does it really need to? The more tightly you tune something, the more complex it is. Aggressive BFD timers, very fast SPF runs, prefix prioritisation (designing for feasible successors in an EIGRP network—yes, they are still around, and yes they can recover extremely quickly if you’ve built them right) are all tools that you can use to make your network react in milliseconds. But a lot of networks don’t need that level of speed—or complexity.
Okay, so taking 30 seconds (or three minutes) is maybe unacceptable, but there’s a middle ground where you have a network that provides the level of fast convergence you need that isn’t overengineered. Because overengineering is expensive, either in terms of the Capex you need to spend to buy extra hardware, or in terms of the Opex you’ll run up trying to configure, manage and troubleshoot it.
Your VoIP network does not need 50msec convergence times. Between 500msec and a second, and your users will never notice. You can get away with longer than that. I’ve seen outages of 50msec and 500msec on high definition video streams, and there was no appreciable difference between the two. It largely depends which type of frames you’re unlucky enough to lose that affects the impact to the user, rather than the network reconvergence.
Yes, there are some very precise signalling applications that really do need 50msec recovery times. But for most of us, subsecond is much more than good enough, and a couple of seconds quite adequate for many applications.
So let’s build networks that are fit for purpose, rather than gold-plated. If nobody is going to notice the difference between a network that recovers in 50msec and one that recovers in say half a second, you’ll get no thanks for building the former.
Mark you, I’ve never seen the point of ‘normal’ road cars that have a top speed of 140 miles an hour, either