The Men & Mice Blog

Network Outages, Human Error and What You Can Do About It

Posted by Men & Mice on 12/18/17 7:14 PM

When your route leaks 

Human error. As far as mainstream reporting on network outages goes, it’s the less flamboyant sidekick to DDoS and other cyber attacks. But in terms of consequences, it’s just as effective.

Once again, beginning of November, large parts of the US found themselves unable to access the internet due to one small error: a misconfiguration at Level 3, an ISP (Internet Service Provider) that underpins other, bigger networks.

According to reports the outage was the result of what is known as a “route leak”. In short, a route leak occurs when internet traffic is routed into inefficient, or simply wrong, directions due to incorrect information provided by one, or multiple, Autonomous Systems (ASes). ASes are generally used by ISPs to keep track of IP addresses and their network locations. Packets of data are routed between ASes, which use the Border Gateway Patrol (BGP) to establish and communicate the most efficient routes so you can browse the whole internet, and not just the IP addresses on your particular ISPs network.

Route leaks can be malicious, in which case they’re referred to as “route hijacks” or “BGP hijacks”. But in this case, it seems the cause of the outage was nothing more spectacular than a simple employee blunder, when (as speculation goes) a Level 3/Century Link engineer made a policy change which was, in error, implemented to a single router while trying to configure an individual customer BGP. This particular incident constitutes what the IETF defines as a Type 6 route leak,  generally occurring when “an offending AS simply leaks its internal prefixes to one or more of its transit-provider ASes and/or ISP peers.”

Route leaks, small and large, are regular occurrences – it’s part and parcel of the internet’s dependency on the basic BGP routing protocol, which is known to be insecure. Other recent high impact route leaks include the so-called Google/Hathway leak in March 2015 and a misconfiguration at Telekom Malaysia in June 2015 which had a debilitating roll-on effect around the world.

To minimize the possibility of route leaks, ISPs use route filters that are supposed to catch any problems with the IP routes that peers and customers intend to use for the sending and receiving of packets of data.

Other ways of combating route leaks include origin validation, NTT’s peer locking and commercial solutions. Additionally, the IETF is in the process of drafting proposals on route leaks.

Factoring in the human element

Tools and solutions aside, Level 3’s unfortunate misconfiguration once again highlights the fact that, despite keeping a low profile in the news, human error still rules when it comes to causing common network outages.

In an industry focused on how to design, build and maintain machines and systems that enable interconnected entities to send and receive millions of packets of data efficiently every second of every day, it’s maybe not all that odd that the humans behind all of this activity become of secondary importance. Though, as technology advances and systems become more automated, small human errors such as misconfiguring a server prefix are likely to have ever larger knock-on effects. At increasing rates, such incidents will roll out like digital tsunamis across oceans, instead of only flooding a couple of small, inflatable IP pools in your backyard.

Boost IT best practices - focus on humans

So outside of general IT best practices, what can you do to help the humans on your team to avoid human error?

Just as with any network, human interaction is based on established relationships. And just as in any network, a weak link, or a breakdown in the lines of communication, can lead to an outage. Humans who have to operate in an atmosphere of unclear instructions, tasks, responsibilities and communication, can become ineffective and anxious. This eats away at employee morale and workflow efficiency and lays the groundwork for institutional inertia and the stalling of progress. At other times, a lack of defined task-setting and clear boundaries may resort to employees showing initiative in the wrong places and at the wrong times.

To limit outages due to human error, just distributing a general set of best practices or relying on informally communicated guidelines amongst staff are simply not enough. While networking best practices always apply, the following four steps can be very effective in establishing the kind of human relationships needed to strengthen your network and optimize network availability.

 

Define DDI-1.png

1. Define

Draw up, and keep updated, a diagram not only of your network architecture (you do have one, don’t you?), but also make sure you have a workflow diagram for your teams: who is tasked with which responsibility and where does their action fit into the overall process? What are the expected outcomes? And what alternative plans and processes are in place if something goes awry? Most importantly, match tasks and responsibilities with well-defined role-based access management.

2. Communicate

Does everyone on your team, and collaborating teams, know who is responsible for what, when and where, and how the processes flow? Is this information centrally accessible and kept up to date? Clarity, structure and effective communication empower your team members to accept responsibility and show initiative within bounds.

3. Train

Does everyone on your team know what’s expected of them, and did they receive appropriate training to complete their assignments properly and responsibly? Do they have the appropriate resources available to do what they need to do efficiently? Without training and tools in place, unintentional accidents are simply so much more likely to occur.

4. Refresh

Don’t wait until team members run into trouble or run out of steam. Check in with each other regularly, and encourage a culture of knowledge sharing where individuals with different skill sets can have ample opportunity to develop new skills and understanding.

Refresh DDI.png

Finally

The saying goes, a chain is only as strong as its weakest link. The same goes for networks.

At a time in history when we have more technological checks and balances available than ever before, it turns out the weakest networking link is, too often, a human. While we’re running systems for humans by humans, we may as well put in the extra effort to help humans do what they do, better. Our networking systems will be so much stronger for it.

 

New Call-to-action

 

Topics: DDI, DDoS, network outages, IT best practices, IP address management

Secure Your DNS Across Multiple DNS Service Platforms with Men & Mice xDNS Redundancy

Posted by Men & Mice on 7/10/17 12:50 PM

DNS (Domain Name System) is the most critical aspect of any network’s availability. When DNS services are halted, or slowed down significantly, networks become inaccessible, leading to damaging losses in revenue and reputation for enterprises.

To ensure optimal network availability, many enterprises depend on top-tier managed DNS service providers for their external DNS needs. The basic “table stakes” characteristics of an enterprise-class managed DNS service are high reliability, high availability, high performance and traffic management. However, even the most robust DNS infrastructure is not immune to outages.

Outages may be localized, in which only certain DNS servers in the network are not responding, or, less commonly, system-wide. A system-wide DNS failure can take an entire business offline - the equivalent of power failure in every one of their data centers.

To prevent this, top-tier managed DNS systems have a great deal of built-in redundancy and fault tolerance, yet the danger of a single point of failure remains for enterprises that rely solely on a single-source DNS service.

If no system of DNS is failure proof, this begs the question: what should an enterprise do about it?

Using multiple DNS service providers for ultimate DNS redundancy

DNS availability statistics for managed DNS providers shows that the industry norm exceeds 5 nines (99.999%) uptime. This is the equivalent of about 5 minutes per year downtime. However, this top line number does not provide any detail on the impact of degraded performance, or the cascading effect of a system-wide outage of various duration, on individual enterprises.

To discover the true impact of a potential loss of DNS availability, enterprises need to properly assess the business risk associated with relying on a sole source provider, and compare that with the cost of a second source DNS service. What would a 30-minute loss of DNS cost the business in terms of revenue loss, reputation damage, support costs and recovery? What does it cost to maintain a second source DNS service?

Research amongst enterprises for whom online services are mission critical generally concludes that the cost ratios are in the range of 10:1 – one order of magnitude. Put another way, the cost of one outage is roughly estimated to be ten times the annual cost of a maintaining a second service. A business would have to have second source DNS for ten years to equal the cost of one major DNS outage.

Looking at the odds and costs of outages, many enterprises are opting to bring in a second, or even a third, DNS service to hold copies of critical DNS master zones.

This system of external DNS redundancy boosts DNS availability by:

External-DNS-Redundancy.png

1. removing the danger of exposure to a single point of DNS failure.

2. reducing traditional master-slave DNS redundancy vulnerabilities, where slave zones can’t be changed if the master becomes unavailable.

3. improving infrastructure resilience by hosting critical zones with multiple providers, ensuring continued service availability and updates of changes if one DNS service provider becomes unavailable.

The risky business of maintaining DNS redundancy across platforms

In theory, DNS redundancy across multiple DNS service provider platforms should be the best solution for optimal DNS high reliability, high availability and high performance. In practice, however, the complexity of tasks and scope for error involved in replicating and maintaining identical DNS zones on multiple platforms pose additional threats to DNS availability. The situation is made worse by:

  • A lack of centralized views
  • A lack of workflow automation
  • The difficulty of coordinating multiple platform APIs

This inability to view, synchronize and update identical zones’ data simultaneously can, in itself, lead to errors and conflicts in DNS configuration and result in a degradation of network performance, or even a network outage – the very events that multi-provider DNS redundancy is intended to prevent.

Protect your DNS on multiple platforms with Men & Mice xDNS Redundancy

Breaking new ground in the battle against DNS disruption, the Men & Mice xDNS Redundancy feature provides the abstraction level necessary to replicate and synchronize critical DNS master zones across multiple DNS service provider platforms, on-premises, in the cloud, or in hybrid or multi-cloud environments.

Men & Mice xDNS provides a unified view and centralized management of DNS data, regardless of the DNS service provider platform. Network administrators and other authorized users can use xDNS to perform necessary updates to their network’s DNS, as well as benefit from building automation with the powerful Men & Mice API, instead of having to dig around in different DNS platforms and deal with coordinating conflicting APIs. DNS-redundancy-and-Men-and-mice-suite.png

Combined with the flexibility of building automation on top of the Men & Mice Suite, xDNS offers you the freedom to better distribute your DNS load based on zone priority, performance requirements and accompanying costs. With xDNS, you are better equipped to steer the tiered price points of externally hosting, for example, critical high-performance or less essential low-performance zones, and utilize the DNS service best suited to your situation at a given time.

 


How xDNS Redundancy Works

Using the Men & Mice xDNS feature, create a zone redundancy group by selecting critical zones from DNS servers and services such as BIND, Windows DNS, Azure DNS, Amazon Route 53, NS1, Dyn and Akamai Fast DNS.

Once an xDNS zone redundancy group has been created, xDNS assists the administrator in creating identically replicated zone content, resulting in multiple identical master zones. Additional zones can be added or removed from the xDNS group as required.

All changes initiated by the user through Men & Mice, both the UI and API, will be applied to all zone instances in the group. All changes made externally to zones existing in the xDNS group, will be synchronized to all zones in that particular xDNS group. However, if DNS record conflicts arise, xDNS will alert the user and provide an option on how to resolve conflicts before the group is re-synchronized.

If an xDNS zone is not available for updating, for instance if one DNS service provider experiences an outage, that zone will be marked as out-of-sync. Once the zone becomes available again, it will be automatically re-synchronized and will receive all updates that were made while the DNS service was unavailable.

Men & Mice and NS1

NS1, the leading intelligent DNS and traffic management provider, recognizes the growing need for diverse application resiliency. NS1 has joined forces with Men & Mice in improving the efficacy of external DNS redundancy. Kris Beevers, Co-founder and CEO, says:

"Leveraging multiple managed DNS networks is the clear best practice for maintaining 100% uptime in today's rapidly evolving operational environment.  Configuring and operating multiple managed DNS services can be a complex, time-consuming process.  NS1 is excited to partner with Men & Mice to help enterprises minimize management overhead and seamlessly enable redundant DNS. xDNS Redundancy is well-suited to enable multi-network DNS without the usual headaches."

Men & Mice xDNS – making external DNS redundancy truly resilient

DNS redundancy is a great concept on paper, but a daunting challenge in practice. With xDNS, enterprises can seek out second, or even third source DNS services, confident in the knowledge that their DNS, and ultimately their business, will truly be safer that way.

Magnus Bjornsson, Men & Mice CEO, considers xDNS an important step towards providing enterprises with greater, and more reliable, network availability.
“Recent prominent network outages once again illustrate the critical importance of building more effective network resiliency through a powerful and secure system of DNS redundancy. Men & Mice xDNS provides a simple way for companies to manage their DNS on multiple external platforms, with the Men & Mice Suite software automatically taking care of the replication and synchronization of data in a reliable and consistent manner. We are looking forward to cooperating with NS1 on developing xDNS and extending DNS redundancy offerings.”

Men & Mice xDNS takes the ‘daunt’ out of maintaining external DNS redundancy, providing the centralized views and control necessary to reduce the risk of network exposure to a single point of failure, improve network reliability and performance and bolster the successful mitigation of DDoS attacks and other potentially harmful DNS incidents.

To learn more about xDNS Redundancy, check out the xDNS webinar, jointly presented by Men & Mice and NS1.

Check out the video to discover how it DDI all comes together:

Or try it out in the Men & Mice Suite:

New Call-to-action

Topics: DNS, Security, High availability, DNS redundancy, DDoS, External DNS, Failover