Highly Available Services for the Enterprise

In this article I am going to explore methods for creating highly available enterprise services by using the inherent properties within the network.

Background

The first thing to recognize is that redundancy and availability are different.  As a networking engineer I can make a service so redundant that there is no single point of failure.  But if clients cannot get to the service they desire all my redundancy is wasted.  I cannot begin to count the number of hours I've spent on calls and email chains with telcos troubleshooting circuit issues.  Gone are the days when a T1 or DS3 would "go down" and trouble shooting would be straight forward.  With ethernet circuit hand-offs everywhere, traffic can go through 5 different devices before it is "connected" to the remote side.  I've had all circuits at a site, that were supposed to be redundant, go the a common piece of teleco gear and when that gear got borked, so did all my circuits to that one site.  I'm not saying redundancy is bad.  Quite the contrary, you should plan redundancy in your circuits, etc.  But redundancy alone is not sufficient to maintain service availability.

"Welcome to the Department of Redundancy Department"

Second, you can have too much redundancy.  I've seen designs that have every up-link from an access switch contained in a port-channel, going to a different physical ports on a VSS distribution stack.  At the same time the distribution stack was running a First Hop Routing Protocol (with no peers).  The up-links of the distribution switches connect to two WAN routers using two L3 up-links per router (also split across the distribution stack) utilizing multi-path routing to distribute traffic.  The WAN routes were also setup to be "active/active" for inbound and outbound communications.  So what is the issue?  Now when something is "slow", troubleshoot where the problem is!  Trace that traffic flow.  I dare you! (If you don't understand the above paragraph, the synopsis is "there is a metric ton of redundancy")

Third, when you have eliminate all hardware single points of failure, you will begin to suffer from software failures.  And software failures in the network will kill you. Well not actually, but you enter the realm of second guessing the design, your understanding of technology, your life choices, your vendor choices, etc.  "Is the ARP making it across the uplink?"  "Is it being received?"  "Is it being processed?"  "How about the OSPF or EIGRP hello packets?"  "How can I have 2 DR's active at the same time?"

At this point half the readers are singing kumbaya and the other half are wondering if I've had a stroke at the keyboard because it appears that I was typing words but they make no sense.  So let me summarize like this.  Things will break in unusual and unexpected ways!  Murphy is very much alive in networking.

One Solution

So what can we do.  First acknowledge stuff will break.  Second acknowledge that when it is broken it will unlikely be broken in two places.  And third, network protocols are litterly designed to move traffic around the failed environment.  So how does that help?  Lets define a simple service...  say Internet access.  And let's assume that I can get Internet access from providers on the west coast and providers on the east coast. Great! Fixed! that was easy.... what's that you say?  "I need a firewall, a WAF, proxy, etc so all that traffic must come back to a central location!"  Does it?  What if the security stack was duplicated and placed one on the west coast and one on the east coast?  Now what if I routed my west coast sites through the west coast access and the east coast sites through east coast access?  Then during a failure on one coast traffic would route across the backbone to the other coast.  Sound good?  

What did it cost and was it worth it?  Well we duplicated the security stacks.

What did we gain? Split the traffic across 2 coasts (assume we reduced traffic by half in a perfect world).  Speed up access to the Internet and reduced latency.   In addition you may not need to have as many or as large equipment because you are only servicing half of your enterprise.

So how does traffic get directed from one hub to another?  First method is with default routing.  For all traffic that is not within the company it should follow the default route.  This means that multiple default routes will exist within the LAN/WAN.  The second method is to utilize Anycast routing (RFC 1546).  This is a method where a single IP address is advertised from multiple locations.  The "closest" location to the client is where the traffic ends up.  Not sure that this works?  This is exactly how Googles 8.8.8.8 DNS service works.  In both cases, when there is a failure at one of the hubs, routing will move the traffic to the other hub.

So what happens during a failure of the security stack?  Of the Internet provider?  The same thing.  Each hub it must maintain a state of "goodness".  This is the key to success.  Each hub must have a way of determining it's "goodness".  Having a process, or several, that check not only the individual components, but the entire path through the hub (which includes the full security stack) is critical.  There are several methods to achieve this.  Examples include using IPSLA probes, Load Balancer probes, EEM scripts or really any other method.  The point of these test are that when the hub reaches a "failed" state ALL services from the hub fail.  What does "failed" actually mean.  That really depends on what the hubs mission.  But when that "failed" threshold is reached the hub "goes dark".  The best method I've found to accomplish this is the use of a "canary route".  This is a route that is not generally used within the WAN environment, but its presence or absence is an indication that the hub is unhealthy and the hub routers should withdraw the default and any Anycast routes.

! Create IPSLA probe for external source
! Multiple probes would be better
ip sla 10
 icmp-echo 198.51.100.1
 threshold 150
 timeout 150
 frequency 60
 history enhanced interval 3600 buckets 100
ip sla schedule 10 life forever start-time now

! Track the probe(s) for reachability
track 10 ip sla 10 reachability

! Create a static route (does not need to be distributed) that 
! exists based upon the state of the track
ip route 192.0.2.1 255.255.255.255 Null0 track 10

! Create some prefix-lists to match routes
ip prefix-list CANARY_ROUTE seq 10 permit 192.0.2.1/32
ip prefix-list DEFAULT_ROUTE seq 10 permit 0.0.0.0/0
ip prefix-list ANYCAST_ROUTE seq 10 permit 203.0.113.10/32

! Create our route-maps for testing and advertising
route-map CANARY_ROUTE permit 10
 match ip prefix-list CANARY_ROUTE
route-map CANARY_ROUTE deny 20

route-map CONDITIONAL_ADVERTISE permit 10
  match ip prefix-list DEFAULT_ROUTE
route-map CONDITIONAL_ADVERTISE permit 20
  match ip prefix-list ANYCAST_ROUTE 
route-map CONDITIONAL_ADVERTISE deny 90

! Assign the route-maps to our BGP peers
router bgp AAAA
 neighbor 1111
  description WAN PEER
  address-family ipv4 unicast
   default-originate route-map CANARY_ROUTE
   advertise-map CONDITIONAL_ADVERTISE exist-map CANARY_ROUTE
Example Canary IOS Config

Still interested?  What other services can be placed in the hub beside Internet access?  DNS work great.  NTP as well.   Basically any services that has/keeps minimal or temporal state is perfect for being placed in the hub.  There are some services that are not appropriate.  A large database probably not, front end and caching servers for that database might be a good candidate.

Conclusion

As network engineers we've been so conditioned that something going "down" is bad and requires all hands on deck, the first time a hub goes down and your phone doesn't start ringing off the hook, you'll think something is wrong. (VoIP issue? ;p )  But the next time it goes down and a lynch mob isn't beating down your door asking why it isn't fixed yet, you'll still feel funny, but there won't be the usual yelling and finger pointing.

I've been operating an environment very similar to the one I described since 2014. I have more than 6 hubs world wide for Internet, WAF, proxy and DNS services.  I'll tell you it works.  I've taken hubs down in the middle of the day for maintenance.  No one noticed.  I even have the security group who runs the security stack do updates to the security equipment.  In fact, instead of fighting the "now we have more devices to take care of" they are actually some the largest advocates of this design.

Hopefully I've at least given you something new to think about.