Highly Available Services for the Enterprise
In this article I am going to explore methods for creating highly available enterprise services by using the inherent properties within the network.
Background
The first thing to recognize is that redundancy and availability are different. As a networking engineer I can make a service so redundant that there is no single point of failure. But if clients cannot get to the service they desire all my redundancy is wasted. I cannot begin to count the number of hours I've spent on calls and email chains with telcos troubleshooting circuit issues. Gone are the days when a T1 or DS3 would "go down" and trouble shooting would be straight forward. With ethernet circuit hand-offs everywhere, traffic can go through 5 different devices before it is "connected" to the remote side. I've had all circuits at a site, that were supposed to be redundant, go the a common piece of teleco gear and when that gear got borked, so did all my circuits to that one site. I'm not saying redundancy is bad. Quite the contrary, you should plan redundancy in your circuits, etc. But redundancy alone is not sufficient to maintain service availability.
"Welcome to the Department of Redundancy Department"
Second, you can have too much redundancy. I've seen designs that have every up-link from an access switch contained in a port-channel, going to a different physical ports on a VSS distribution stack. At the same time the distribution stack was running a First Hop Routing Protocol (with no peers). The up-links of the distribution switches connect to two WAN routers using two L3 up-links per router (also split across the distribution stack) utilizing multi-path routing to distribute traffic. The WAN routes were also setup to be "active/active" for inbound and outbound communications. So what is the issue? Now when something is "slow", troubleshoot where the problem is! Trace that traffic flow. I dare you! (If you don't understand the above paragraph, the synopsis is "there is a metric ton of redundancy")
Third, when you have eliminate all hardware single points of failure, you will begin to suffer from software failures. And software failures in the network will kill you. Well not actually, but you enter the realm of second guessing the design, your understanding of technology, your life choices, your vendor choices, etc. "Is the ARP making it across the uplink?" "Is it being received?" "Is it being processed?" "How about the OSPF or EIGRP hello packets?" "How can I have 2 DR's active at the same time?"
At this point half the readers are singing kumbaya and the other half are wondering if I've had a stroke at the keyboard because it appears that I was typing words but they make no sense. So let me summarize like this. Things will break in unusual and unexpected ways! Murphy is very much alive in networking.
One Solution
So what can we do. First acknowledge stuff will break. Second acknowledge that when it is broken it will unlikely be broken in two places. And third, network protocols are litterly designed to move traffic around the failed environment. So how does that help? Lets define a simple service... say Internet access. And let's assume that I can get Internet access from providers on the west coast and providers on the east coast. Great! Fixed! that was easy.... what's that you say? "I need a firewall, a WAF, proxy, etc so all that traffic must come back to a central location!" Does it? What if the security stack was duplicated and placed one on the west coast and one on the east coast? Now what if I routed my west coast sites through the west coast access and the east coast sites through east coast access? Then during a failure on one coast traffic would route across the backbone to the other coast. Sound good?
What did it cost and was it worth it? Well we duplicated the security stacks.
What did we gain? Split the traffic across 2 coasts (assume we reduced traffic by half in a perfect world). Speed up access to the Internet and reduced latency. In addition you may not need to have as many or as large equipment because you are only servicing half of your enterprise.
So how does traffic get directed from one hub to another? First method is with default routing. For all traffic that is not within the company it should follow the default route. This means that multiple default routes will exist within the LAN/WAN. The second method is to utilize Anycast routing (RFC 1546). This is a method where a single IP address is advertised from multiple locations. The "closest" location to the client is where the traffic ends up. Not sure that this works? This is exactly how Googles 8.8.8.8 DNS service works. In both cases, when there is a failure at one of the hubs, routing will move the traffic to the other hub.
So what happens during a failure of the security stack? Of the Internet provider? The same thing. Each hub it must maintain a state of "goodness". This is the key to success. Each hub must have a way of determining it's "goodness". Having a process, or several, that check not only the individual components, but the entire path through the hub (which includes the full security stack) is critical. There are several methods to achieve this. Examples include using IPSLA probes, Load Balancer probes, EEM scripts or really any other method. The point of these test are that when the hub reaches a "failed" state ALL services from the hub fail. What does "failed" actually mean. That really depends on what the hubs mission. But when that "failed" threshold is reached the hub "goes dark". The best method I've found to accomplish this is the use of a "canary route". This is a route that is not generally used within the WAN environment, but its presence or absence is an indication that the hub is unhealthy and the hub routers should withdraw the default and any Anycast routes.
Still interested? What other services can be placed in the hub beside Internet access? DNS work great. NTP as well. Basically any services that has/keeps minimal or temporal state is perfect for being placed in the hub. There are some services that are not appropriate. A large database probably not, front end and caching servers for that database might be a good candidate.
Conclusion
As network engineers we've been so conditioned that something going "down" is bad and requires all hands on deck, the first time a hub goes down and your phone doesn't start ringing off the hook, you'll think something is wrong. (VoIP issue? ;p ) But the next time it goes down and a lynch mob isn't beating down your door asking why it isn't fixed yet, you'll still feel funny, but there won't be the usual yelling and finger pointing.
I've been operating an environment very similar to the one I described since 2014. I have more than 6 hubs world wide for Internet, WAF, proxy and DNS services. I'll tell you it works. I've taken hubs down in the middle of the day for maintenance. No one noticed. I even have the security group who runs the security stack do updates to the security equipment. In fact, instead of fighting the "now we have more devices to take care of" they are actually some the largest advocates of this design.
Hopefully I've at least given you something new to think about.