Architecture and design of function-specific, wire-speed routers for optical internetworking
Figure 1. A look at Internet backbone traffic patterns reveals that packet sizes can be seen as being trimodal, which has significant implications for router design.Future extrapolation of the traffic model can be achieved by combining the forecasted increase in traffic with the trimodal packet-sized distribution. In the simplest case, we can expect an overall increase in the volume of acknowledgement traffic (minimum-sized packets). In the more realistic case, we must overlay best efforts and mission-critical traffic along with streaming services and consequently need to add a third dimension to the mix of traffic-the type of service.Meanwhile, with the move toward a converged IP-based infrastructure, carriers require intelligent devices at network aggregation. With delay-sensitive streaming content being injected into the network, line-rate performance is a non-negotiable criterion. Additionally, re quirements for ultra-high connection densities, as well as highly efficient, low-power, space-conscious equipment, has led to a new metric to qualify carrier gear: routed gigabits per (power unit X rack space unit X $).To grasp the interplay of these trends with system design, some fundamental concepts must be understood. For example, wire speed (or line rate) refers to the maximum rate at which any physical medium can sustain information transfer. The key variables that determine wire rate are the number of bits per second that the physical medium is capable of transporting as well as the size of the minimum quanta of information (packet or cell). Thus, a link capable of supporting 2.4 Gbits/sec (OC-48c) that carries 40-byte (320-bit) packets with no interpacket gap or overhead bits corresponds to a packet rate of one packet arriving every 129 nsec.Wire-speed or line-rate processing requires that operations be performed on a per-packet basis at the maximum packet arrival rate (every 129 nsec for the aforementioned case). Guaranteeing line-rate packet processing and forwarding performance have numerous positive side effects in the QoS domain.Meanwhile, the concept of routing focuses on using network-layer information to forward packets. The basic network-layer functions (OSI Layer 3 and 4) consist of the following:Route processing. Where is the packet destined to arrive?Flow processing. Stateful information that categorizes a packet or group of packets that belong to an information session.The action of determining the destination of a packet based on data embedded within it is termed route processing. IPv4 networks use classless interdomain routing (CIDR), which was instituted by the Internet Engineering Task Force (IETF) in the 1980s to optimize the use of available address space. The basic principles of CIDR involve the segmentation of the Internet into a hierarchical, logically addressable group of subnetworks. Consequently, each router is required to keep track of only the paths that are directly accessible via its network interfaces. CIDR's logical addressing scheme requires a "longest network prefix match" operation, which is set by a mask on a 32-bit IPv4 address. CIDR route lookups are not direct table matches and thus become quite complex with large tables.
The complexity of a CIDR route lookup dramatically changes with the total number of routes in a route table. The nested nature of the addressing scheme causes a logarithmic change in lookup time with increased table size. Wire-speed algorithmic CIDR route lookup is nontrivial, since it involves translating an algorithm into hardware (such as an application-specific integrated circuit) and ensuring that it provides deterministic convergence under worst-case traffic conditions. A second challenge is to keep the jitter (i.e., the variation in algorithm convergence timings) bounded so as to limit latency within the network.
Packet classification is the key element of flow processing. Packets may be classified based on a parameterized set of metrics that may involve multifield packet header analysis. The parameters are usually specified by a user in conjunction with resource information that may be derived from routing protocols.
Flows in connectionless networks are determined by grouping packets that have common application-layer or session-layer information. A flow can be based on information transacted between a particular source and destination IP address or a TCP/UDP socket. Flows can also be based on DiffServ code points or type of service bits. Fundamentally, classification of like packets based upon information contained within each of them constitutes a flow.
DiffServ, or "Differentiated Services," deserves further explanation. DiffServ results from IETF initiatives to specify a means of providing end-to-end QoS in a connectionless packet-based network. The IPv4 packet header comprises a byte that consists of a 3-bit type of service field and a 5-bit field that provides 32 extra code points for marking packets to denote various levels of service. These DiffServ labels may be generated from source nodes in the network and may be altered by intermediate routers to shape network traffic. DiffServ is meant to provide a granular means of differentiating classes of service at the network edge.As mentioned earlier, a flow can be identified by various parameters, including DiffServ labels and application (TCP/UDP) information. Edge flows with granularity are termed microflows. An example of a microflow would be the classification of all packets of a certain TCP/UDP socket that originate from a particular IP address, or all RTP traffic destined for a certain IP address. Once a packet has been classified at the network edge and has been identified with a particular flow, it is forwarded out of the particular routing device onto the next level of aggregation within the network.
Figure 2 shows a multi-edge Internet model that illustrates the various levels of aggregation occurring at different points in the network and the rough route and flow metrics at these points. It is important to recognize the inefficiency of multiple examinations of the same flow of packets at various aggregation points. In fact, as we approach the backbone, data pipes get larger and packet arrival rates increase, making it impractical to perform deep-packet examination within the core. Additionally, the core may be operating on a different link-layer protocol such as Asynchronous Transfer Mode (ATM).Enter macroflows. A macroflow consists of a logical grouping of similar microflows. For instance, all packets entering a backbone or core device that have similar microflow information (e.g., DiffServ labels) may be grouped into a macroflow and can be metered, policed, and engineered efficiently. The concept of hierarchy in flow management and QoS classification has led to the use of Multiprotocol Label Switching (MPLS) as a means to manage and engineer macroflows.
MPLS was initially conceived to be a mechanism that unified the IP and ATM domains at the Internet core. It has, however, also become a powerful traffic-engineering tool. At the simplest level, MPLS allows core traffic to be engineered at either a circuit level via an ATM switch or at a packet level. The actual physical tag may denote an ATM virtual channel that has prescribed traffic behavior, or it may be used as a way to abstract Layer 3 microflow information and engineer macroflows at the core.
Path discovery involves the use of routing protocols. Routing protocols such as RIP, OSPF, or BGP-n are inter-router information-exchange mechanisms that build and maintain packet-forwarding tables used by the packet-forwarding blocks to physically route traffic and by policy and flow software to maintain and update flow tables. These protocols include algorithms that use value metrics based on a variety of parameters. An example is a network distance-vector metric, i.e., the closest network entity that has a path to the final destination. Other metrics used to build tables include latency and reliability.
In an IP network, the network-layer functions drive the QoS assigned to various types of traffic. QoS is applied via traffic engineering, which involves three distinct mechanisms:
- Admission control. This mechanism acts on incoming traffic that has been categorized by the network layer to ensure that all flows of information meet predetermined profiles (arrival rates), which in turn are determined by service-level agreements.
- Traffic shaping and bandwidth management. In this case, flows and other related parameters are used to determine when and at what rates various types of packets egress the system. Queuing becomes an essential part of the shaping and bandwidth management.
- Congestion control. All network devices are expected to experience congestion. While QoS is generally thought of in terms of prioritizing outgoing traffic, the avoidance of congestion is a key mechanism often sidelined or forgotten. Large, time-varying traffic patterns coupled with service overlays on the infrastructure could potentially cause network outages. Controlling congestion involves statistical coloring of traffic based on network and application layer information. Usually, processes such as random early detection (RED) monitor the state of various queues within the system and start to drop packets based on their capacity. It is important to note that drop processes such as RED can be modulated by weights that are user-supplied.
It is extremely important to note the path-discovery process time constant is on the order of 10 to 100 msec, while that of the packet-forwarding process scales with line rates (1/129 nsec at OC-48c). The large time-constant difference between these processes presents a logical opportunity for first-order partitioning-separation of the packet classification and forwarding paths from the routing control path. Subsequent architectural decisions involve further partitioning of the packet classification/forwarding paths.
There are two broad methods generally followed: centralized packet forwarding and distributed packet forwarding. Key factors that drive the choice of approach include scalability, protocol support, and power.The basic concept of distributed packet forwarding is illustrated in Figure 5. NICs essentially assume the role of a full router. They comprise all the hardware, including the PMD and link layer, but are also fully equipped with packet-processing functionality as well as local QoS and traffic engineering. The switch fabric is purely optimized for nonblocking transport of packets across line cards and does not include any sophisticated QoS/traffic-engineering functions.
A summary of the differences between centralized and distributed packet forwarding-as well as the implications of these differences-appears in Table 1.
Let's look at relevant requirements and explore "first-cut" architectural partitioning for a backbone routing device capable of line-rate performance at 40 Gbits/sec. Specifications for the functional requirements at a system level are contained in Table 2. Thus, most of the discussion within this section explores architectural and component-level implications of these requirements.As stated previously, scalability, performance, and power requirements drive system partitioning. A distributed packet-forwarding architecture clearly lends itself to a more scalable system. The distributed architecture allows for building-out of a maximum capacity chassis and backplane and de-couples the scaling of the network layer from the switch fabric. It's possible to take a "divide and conquer" approach to solving system performance and scaling issues by:
- Reducing the complexity of the switch fabric and making it an ultra-fast, highly integrated, dedicated data-transport layer.
- Building a chassis with an optical backplane (fiber) that can scale as high as 20 Gbits/sec per link.
- Decoupling network layer from switch fabric allowing for maximum flexibility in scaling each independently.