Building a highly available enterprise network

April 2, 2010
The right network technology enables IT managers to build the highly available communication infrastructure they need for non-stop operations, while at the same time lowering capital and operational expenses.

By Scott Calzia and Lenny Bonsall

Overview

The right network technology enables IT managers to build the highly available communication infrastructure they need for non-stop operations, while at the same time lowering capital and operational expenses.

The high cost of downtime and an increased reliance on network infrastructures are driving the availability requirements for enterprise networks. Business processes need to operate around the clock with increasingly distributed applications to support employees, partners, and customers spread across the globe.

At the same time, the enterprise network has evolved from data-only transport to a converged multiservice freeway carrying a mix of data, voice, and video as well as traffic from disparate networks, such as security scanners and other building automation systems. With the widespread adoption of IP telephony, IT is expected to deliver the same level of availability for this converged enterprise network that users have come to expect from traditional PBX systems.

The consequences of network downtime are numerous and costly. Immediate and potentially significant revenue loss; damage to the company’s image or reputation; drops in productivity when employees can’t access email, phones, or critical business applications—each of these have the potential to adversely impact business operations and drive customers away.

Network technology can advance the economics of networking by enabling companies to build the high-performance, highly available enterprise networks needed for non-stop operations. Businesses expect the network to be available 24/7, and many enterprise IT departments are tasked with meeting “five nines” uptime requirements set at six minutes or less of service interruption per year (99.999% uptime). The key to achieving this level of uptime is boosting device, network, and operational availability.

The three aspects of availability

An enterprise network’s availability is a function of the reliability and maintainability of its components. To increase availability to meet demands for 99.999% uptime, reliability and/or maintainability must be increased. By how much is a business decision, often driven by cost.

Clearly, IT will want to focus on those portions of the network that support the largest number of users and the most critical resources. For example, the loss of an access switch will affect a few dozen users, whereas the loss of a data center switch that supports servers running business-critical applications will affect the entire company and possibly one’s customers.

Once IT has determined the availability requirements for each portion of the infrastructure, the network must be designed and products selected with three aspects of availability in mind:

  1. Device availability
  2. Network availability
  3. Operational availability

Designing in device availability

Device availability encompasses reliability and maintainability features. At the device level, redundant components, including power supplies, fan trays, control modules, interface cards, and switch fabrics, eliminate the most common causes of hardware failure. To increase maintainability, these components should be both field-replaceable and hot-swappable, and failover from the failed component to the backup component should be automatic and seamless in highly critical portions of the network. In remote offices or branch locations lacking IT staff, field-replaceable components may increase maintainability by making it possible for untrained local staff to replace failed components without waiting for and requiring IT’s assistance.

A modular operating system in which protocols are compartmentalized is essential to stability because it provides functional separation of software components (Figure 1). Networking devices offer rich functionality that inherently increases reliance on the operating system software. In a modular operating system, a software malfunction in one protocol may be isolated to a single module, enabling the rest of the operating system to continue functioning with minimal network and user interruption. Likewise, if a problem is identified within a given module, the problem can be resolved and the module restarted gracefully without interrupting the rest of the operating system.

Conversely, in a monolithic, non-modular operating system without compartmentalization, such a software malfunction could cause a full operating system crash. This would require the entire OS to be restarted, significantly affecting availability.

Enterprise networks are also subject to known outages, such as for scheduled maintenance and upgrades. However, round-the-clock demands mean there are fewer off-peak traffic periods in which to schedule these outages, so IT must be able to perform maintenance and upgrades without disabling the network. To maintain maximum availability, network devices deployed in sensitive parts of the enterprise network should be capable of supporting in-service software upgrades, or the network must be designed with redundant devices so that upgrades are transparent to the user.

Boosting network availability

Network availability encompasses reliability and maintainability mechanisms and configurations. Fundamental to increasing network availability is simplifying the network architecture.

By collapsing multiple switching tiers in traditional architectures, a simplified network requires fewer devices and interconnections—leading to improved efficiencies in availability as well as space, power, cooling, and management (Figure 2). This reduction in equipment and improved performance with high availability are achieved by combining virtualized devices with line-rate performance.

Redundant paths are used to interconnect various network devices not only to provide an increase in bandwidth, but to provide alternative connections in the event of a port, link, or device failure. Designing the enterprise network with virtualized devices enables such redundant, load-sharing interconnects between devices and reduces the number of logical devices in the network, which results in increased operational availability.

In conjunction with redundant paths, the right network protocols provide fast failover and recovery in the event of a primary link failure. Traditionally, an enterprise LAN architecture has used the spanning tree protocol (STP) to provide active-standby redundant paths and to prevent loops in the network. However, STP can take up to 30 seconds or longer to resolve link failures and re-establish loop-free paths throughout the network.

Equal-cost multipath protocols, on the other hand, are capable of providing redundant paths, sub-second failover, and full active-active use of available link capacity. Equal-cost multipath protocols, used with virtualized network devices, also improve reliability by providing redundant paths spread across redundant devices.

Virtualized devices improve availability in other ways. While historically it made economic sense to use Layer 2 protocols at the access layer of the enterprise network, network designs today are less complex—and therefore more available—if virtualized devices are employed to create a single, routed control plane that operates from access layer uplinks to the aggregation and core layers. Such an architecture eliminates the need for STP.

Additionally, a routed infrastructure supports more deterministic traffic flows, allowing IT to proactively identify potential problems and therefore increasing maintainability. A routed infrastructure also reduces the size of failure domains, so a failure affects fewer users and therefore further increases network availability.

IT needs to select network devices as well that ensure consistent throughput and traffic control across the entire network. This rule is especially true in the enterprise network core, where a device overwhelmed with traffic may propagate congestion problems throughout the network. Such a situation can result in traffic loss, slow response times, and the inability to prioritize business-critical applications.

Finally, one of the most important aspects of
availability is the ability to protect the network from misuse. Access control features enable IT to strictly control who can join the network and what they can access. Such features also enable the enforcement of threat-management policies, such as anti-virus and software patches to prevent unintended problems from infiltrating the enterprise network.

Operational availability—simplifying operations

Only a small percentage of network downtime is caused by hardware or software failures. The vast majority—between 60 and 80 percent—is the result of unexpected events caused by human error (Figure 3). Research shows that misconfigurations, unauthorized changes, and operator errors are the most common causes of unplanned network downtime. IT needs a highly available network infrastructure that not only minimizes hardware and software faults, but mitigates or even prevents the impact of human error and provides an audit trail to learn from these incidents.

Given that human error is the leading cause of network downtime, enterprises have the most to gain from operational availability, which equates to simplifying and automating routine operations and maintenance. IT can simplify operations by selecting products with features, processes, and tools that reduce complexity and automate tasks. IT can also reduce network complexity by using standards-based technologies and products, as well as simplifying the architecture and eliminating network tiers.

In addition, having the same operating system software across all network infrastructure and security platforms makes it easier to roll out new features and new versions of software. Support for standards reduces compatibility problems and boosts interoperability among different vendors’ devices.

Sophisticated operating systems include flexible scripting technologies that run on networked devices to avert configuration errors and accelerate problem identification and resolution. Using these intelligent and customizable scripts, IT can extend their expertise across the network infrastructure by programming the operating system to automatically simplify and validate configuration changes to prevent common operations mistakes or other human errors. If a problem does result from a configuration change, the operating system should include a rollback feature that enables the change to be undone and the configuration returned to an earlier, problem-free version.

Not only do such scripts automatically detect and diagnose performance problems, they also provide a continuous improvement capability that allows engineers to identify problems and then proactively write new scripts designed to avoid such problems in the future.

To build on these scripting capabilities, a sophisticated operating system also offers a comprehensive set of built-in tools and technologies designed to provide technical staff with the automated delivery of tailored, proactive network intelligence and support services. By integrating advanced support intelligence into networking platforms, automating support steps, and providing proactive insight into platform operations, such an operating system increases network availability and lowers operational costs.

This disciplined approach not only limits the impact of human error, it also dramatically reduces IT configuration, operations, and management overhead. IT can configure and manage each feature the same way with the same effect throughout the network and use the same tools to monitor, manage, and update multiple devices. As a result many configuration and maintenance tasks are automated, which reduces downtime caused by human error.

Conclusion

Network technology can advance the economics of networking, enabling IT to build the highly available communication infrastructure they need for non-stop operations, while at the same time lowering capital and operational expenses. By combining redundancy and resiliency features in a variety of form factors at competitive price points, these technologies can give IT tremendous flexibility, making it possible for enterprises to build high availability into every part of the network. As a result, the enterprise benefits from predictable network behavior and improved uptime, while IT benefits from simplified operations.

Scott Calzia is a senior marketing manager and Lenny Bonsall a product marketing manager with Juniper Networks’ Fabric & Switching Technologies Business Group.

Links to more information

Lightwave: Roles Different Protocols Will Play in the Cloud
Lightwave:The Growth of Fibre Channel over Ethernet
Lightwave: FTTE Battles for Enterprise/SAN Acceptance

Sponsored Recommendations

Getting ready for 800G-1.6T DWDM optical transport

Dec. 16, 2024
Join as Koby Reshef, CEO of Packetlight Networks addresses challenges with three key technological advancements set to shape the industry in 2025.

On Topic: Tech Forecast for 2025/ What Will Be Hot

Dec. 9, 2024
As we wind down 2024, Lightwave’s latest on-topic eBook will examine the hot topics for 2025. AI is at the top of the minds of optical industry players supporting...

Meeting AI and Hyperscale Bandwidth Demands: The Role of 800G Coherent Transceivers

Nov. 25, 2024
Join us as we explore the technological advancements, features, and applications of 800G coherent modules, which will enable network growth and deployment in the future. During...

From Concept to Connection: Key Considerations for Rural Fiber Projects

Dec. 3, 2024
Building a fiber-to-the-home network in rural areas requires strategic planning, balancing cost efficiency with scalability, while considering factors like customer density, distance...