A Fast Switched Backplane for a Gigabit Switched Router
September 22, 2008
This paper describes the motivation for replacing congested, shared backplanes with switched backplanes. There are two general types of router functions: datapath functions operate on every packet passing through the router, while control functions operate more infrequently. To focus on per-packet router performance, attention should be paid to the datapath functions. These include forwarding decisions, backplane queuing/dropping of packets, and the output link scheduler. Router evolution has pushed more and more functionality into the router hardware, while shying away from the use of shared buses (i.e. shared memory backplanes). The limitation resulting from using a shared central bus, with central CPU, memory and line cards is that the CPU has to process every packet which thus limits the system throughput. Parallelism helps by having multiple CPUs to process packets at the same time, especially if separate CPUs are placed between the bus and each line card interface, while a central CPU handles the forwarding tables. However, this architecture can be limited by software-based forwarding decisions. Throughput can also be increased by using a crossbar switch so that multiple line cards can communicate with one another simultaneously. The ideal architecture is thus characterized as using a cross-bar switch with hardware forwarding decision engines.
Switched backplanes are necessary because bus bandwidth is not increasing as fast as the need for routing bandwidth, and high-speed shared backplanes lead to congestion due to port contention as well as limited transfer capacity due to limits of electrical loading. Crossbar switches can enable high performance because connections from line cards to the central switch are high-speed point-to-point links, and can support multiple bus transactions simultaneously. Efficiency in the crossbar switch can be increased by using fixed-length over variable-length packets, as the former as simpler, faster, and allow for time progression in fixed-length increments.
Crossbar-switches are internally non-blocking, but their performance can still be limited by head-of line blocking (HOL), input blocking, and output blocking. The HOL problem is solved with virtual output queuing (VOQ), in which there is a separate FIFO input queue for each output. The other two problems involve controlling delay of packets inside the router. The authors propose two solutions: (1) prioritize packets so that the delay of a high-priority one doesn’t affected by that of a low-priority one, or (2) run the crossbar switch faster than the external line rate. The authors state that a crossbar scheduling algorithm should have high throughput, be starvation free, be fast, and be simple to implement. The iSLIP algorithm meets these requirements; it iteratively selects crossbar configurations to match inputs to outputs while using round-robin scheduling of active inputs. The algorithm is extended (EiSLIP) to support multicast in addition to unicast.
I liked this paper more than the previous one because its figures were easier to decipher. It seemed to have the opposite problem, though–much less detail. Perhaps I should have read this one first. Still, the paper elaborated on each of the figures and referred to them while describing motivation for making architectural changes. In section 2, the authors mention that it is a common misconception that routers are limited by their ability to forward packets, and I was wondering why this is so.
Scaling Internet Routers Using Optics
September 22, 2008
The authors describe their motivation as wanting to see how Internet router capacity can scale to keep up with Internet traffic growth, and how optical technology inside routers can help that scaling. Most high capacity routers these days are multi-rack in order to reduce the power density incurred when using a single rack. Multi-rack systems, however, can have unpredictable performance and poor scalability. The authors describe the Load-Balanced switch that has a single stage of buffers between identical two stages switching. There are N intermediate inputs, each with N FIFO queues for each of the N outputs. The first stage acts as a load-balancer to spread traffic over each output queues, while in the second stage each output queue is served at a fixed rate. The authors show that with uniform packet arrival rate, there is guaranteed 100% throughput at a fixed, equal-rate switch. The second load-balancing stage can spread out non-uniform traffic in order to meet the above condition for guaranteed throughput.
Without a centralized scheduler, there still needs to be a switch fabric that is reconfigured for every packet. Instead of doing that, each fixed,equal-rate switch can be replaced with a mesh of N^2 fixed channels with rate R/N and two switches can replaced with a single switch that runs twice as fast. Then every packet will go through the single stage twice, each at rate R/N. Having N^2 mesh links becomes impractical as N grows large, so the authors decide to use wavelength-division multiplexing to reduce the number fibers while increasing the data-rate on the remaining fibers. Another problem to be dealt with was the potential for packet mis-sequencing since packets from the same application could be routed to different intermediate linecards and thus to different outputs. To deal with this, the authors proposed a scheme called Full Ordered Frames First (FOFF) to bound the intermediate queue lengths and any necessary re-sequencing. If some linecards are missing, the switch fabric needs to deal with spreading the traffic amongst the remaining linecards. The authors describe two architectures to deal with this problem.
I had to look up a couple definitions while I was reading this paper. A single stage crossbar is where every input is connected to every output, like having a matrix of connections. In wavelength-division multiplexing, multiple optical carrier signals are multiplexed “on a single optical fiber by using different wavelengths of laser light to carry different signals” (thanks Wikipedia).
Even after looking up some definitions, I am still a little confused about the implications of using single-stage versus other types of switching. This paper was hard to read, and I found it difficult to fully understand and relate to the problem the paper was trying to solve. I think I disliked this paper because I had trouble following the story for motivating and deriving the solution. I also didn’t realize the authors’ proposal for the electro-optical router wasn’t completely tangible at that time, until they pointed out that they predict it could be built in three years (so 2006?). I’m wondering what router changes were made because of this work, or if real-life implementation never panned out.
Random Early Detection Gateways for Congestion Avoidance
September 16, 2008
TCP only detects congestion when packet dropped, but it would also be desirable to not have full queues all the time; this would increase average delay throughout the network. Want to keep throughput high while keeping average queue size low. Without gateway knowledge, TCP has difficulty in distinguishing propagation delay from persistent queuing delay. In Random Early Detection (RED), gateway detects congestion in its early stages by computing the average queue size for each output queue. The gateway chooses connections to notify about the congestion. When average queue size exceeds a threshold, gateway marks packets with probability proportional to its connection’s share of the throughput. The gateway signals congestion by dropping packets or by setting feedback bit in packet header (depending on transport protocol).
If RED chooses to mark packets by dropping them, then it is controlling queue size even without cooperation from transport protocol. RED avoids global synchronization where all connections reduce their window, avoiding the slow-start for everyone. RED works concurrently with the congestion control provided by the transport layer, but is designed for a network where a single packet loss is enough to indicate congestion. It supports gradual deployment in Internet since can coexist with other schemes. RED was influenced by Early Random Drop gateways and binary feedback congestion avoidance schemes that calculate average queue size. RED avoids global synchronization observed with Drop tail gateways and RED is better at computing the queue size over a longer period of time with a time-based exponential decay. RED also separates the algorithms for detecting congestion and choosing connections to send feedback bit to; in order not to bias against bursty traffic, RED randomly picks packets to mark. The algorithm computes average queue size using low-pass filter with an exponential weighted moving average: short-term increases in queue size do not significantly raise average queue size. When the size is below a minimum threshold, no packets are marked. When size is above maximum threshold, all packets marked get marked. When in between thresholds, packet is marked with probability as described above.It’s important to keep average queue size low while still allowing fluctuations in queue size to accommodate bursty traffic.
I was confused as to why do the authors suggest that fair queuing and other control schemes only be used when non-per-connection mechanisms are inadequate. Is it significant overhead to deal with individual connections? How is inadequacy defined? On an unrelated note, I didn’t get a good feel about the associated tradeoffs of the min and max threshold parameters.
Congestion Control for High-Bandwidth-Delay Product Networks
September 16, 2008
TCP becomes oscillatory and prone to instability as product of bandwidth and latency increases. The paper describes protocol (XCP) that is fair and efficient as link bandwidth or Round Trip Time increases. TCP may waste a lot of RTTs getting back up to full utilization after congestion; this is especially unfortunate for short TCP flows that can’t acquire bandwidth during slow start even when it’s available. Really want to decouple control of utilization and fairness for service differentiation. The authors argue that packet loss is not a good indicator of congestion since it takes time to find out about loss and it isn’t the only symptom of congestion–congestion signaling should be more than binary; the network should tell the sender how much congestion there is and how to deal with it.
There needs to be a feedback loop between controller and aggregate traffic on link, which motivates separation of efficiency control and fairness control; have efficiency controller (EC) and fairness controller (FC). A packet header has sender’s cwnd and RTT estimate, as well as feedback initialized to its demands–this field is modified by routers. EC looks at aggregate traffic to maximized link utilization while minimizing drops and persistent queues. Its feedback is proportional to spare bandwidth, defined as difference in link capacity and utilization, as well as the persistent queue. EC doesn’t care by how much each individual flow adjusts its window, only focusing on total traffic. The FC still uses the additive-increase, multiplicative-decrease principle, so its positive feedback has each flow throughput increase by same while negative feedback of a flow is proportional to its current throughput. The multiplicative part makes fairness converge, and the process is decoupled from actual packet loss. When efficiency is already near optimal, use bandwidth shuffling to simultaneously allocate and deallocate bandwidth such that overall traffic rate doesn’t change but individual flow throughputs change to approach fairness.
Tuthors describe simulation showing XCP outperforms TCP in conventional and high-bandwidth-delay environments. The simulation shows that, regardless of queuing scheme:
- TCP performance is degraded by an increase in link capacity and propagation delay while XCP always has near optimal utilization and no dropped packets
- with short, web-like flows, XCP maintains high utilization and short queues
- XCP is fairer than TCP
Core-Stateless Fair Queuing
September 11, 2008
End-to-end congestion control can be improved when routers have the ability to allocate bandwidth fairly. However, fair allocation mechanisms require the routers to maintain state and take action on a per-flow basis. Dealing with individual packets makes these mechanisms complex and can thus hinder their practical adoption. In the authors’ proposed architecture, they define an “island of routers” that has an edge and core. The edge routers compute per-flow rate estimates and insert these estimates into each packet header. The core routers use FIFO queuing with no per-flow state, using the rate estimates in the packet headers and their own traffic measurements. The authors don’t claim that their technique can achieve fairness levels like that of Fair Queueing, but emphasize that they simply wish to have an approximation to fair bandwidth allocation. The paper first describes the fluid model algorithm, in which a router is bufferless and has a “fair share rate” describing the maximum flow at any given time. A probabilistic forwarding algorithm shows the probability of dropping each incoming bit of a flow, and thus the arrival rate of a flow is the minimum of it and the fair share rate.
In a real system, arrival rates are unknown and need to be estimated. The flow rate is estimated at each edge router via exponential averaging an iterative algorithm that updates the estimate with arrival times and lengths of incoming packets. To estimate the fair share rate, need to aggregate measurements of the summation of the min(flow rate, fair rate)–the accepted traffic–and of the aggregate arrival rate. Both estimates are computed via exponential averaging iterative algorithms updates. The update rule for the estimated fair share rate depends on whether the network is congested, in order to match accepted rate to the link bandwidth. These algorithms can also be supplemented with weighted flows to implement differential services. The authors provide a somewhat reasonable explanation as to why a flow can exploit the fuzziness in using estimates to gain more than its fair share. Their example shows that they can bound the amount of “excess service” a flow gets, thus eliminating any long-term benefits.
A simulation compares CSFQ to algorithms with fairness attempted (FIFO, RED) and as well as some with that try fairness (FRED, DRR), with the goal of seeing where CSFQ lies between those two extremes. In the single congested link experiments, DRR has the best fairness achievement, while does a little worse (but better than FIFO, RED, FRED). When the UDP flows are ill-behaved and TCP flows introduced, only DRR and CSFQ are able to contain the UDP flow. Also notable was that CSFQ performs better than DRR with a large number of flows. In the multiple congested link experiments, CSFQ falls only slightly behind DRR. CSFQ also performs well in the adaption schemes, ON-OFF, large latency, and packet relabeling experiments, but poorly in the web traffic experiment.
I think this paper answers my question the FQ paper regarding its practicality; while FQ does achieve optimal fairness and has inspired research, it is too complex to implement in practice. There was a statement in this paper that confused me, however. The authors state their assumptions that (1) fairness mechanism is important for congestion control, and (2) FQ are too complex to be practically useful. I suppose the latter assumption could be debated, but is the former assumption implying that fairness considerations are considered unnecessary for congestion control? When would it hurt?
Analysis and Simulation of a Fair Queueing Algorithm
September 11, 2008
There are two points of implementing congestion control methods. At the source, flow control algorithms can vary the rate at which packets are sent to limit the overall network project. The second point is at the gateway through routing and queuing algorithms. Queuing algorithms affect network traffic by determining how packets from different sources interact to influence the behavior of flow control algorithms. These algorithms allocate bandwidth, promptness, and buffer spaces at the gateway. The FCFS intertwines these three independent allocation issues, and does not function well if there are ill-behaved sources. The authors remark that the Fair Queuing model proposed by Nagle does not have performance data, and thus the paper demonstrates via simulation FQ versus FCFS.
The motivation for fair sharing is that an ill-behaved source cannot get more than its fair share of bandwidth and buffer space, which includes consideration of packet length (Nagle did not consider this). Round-robin service provides fair allocation of sent packets, but is not a fair allocation of bandwidth due to differing packet lengths. While sending packets bit-by-bit in a round-robin fashion would solve that problem, it is definitely unrealistic in practice. In the authors discussion of the their transmission algorithm, they denote R(t) as the number of rounds at time t; N_ac(t) as the number of active conversations; S_i and F_i as the respective start and finish time R(t) of a packet. These functions and quantities depend only on packet arrival times t and not packet transmission times. In the authors’ packet-by-packet transmission algorithm, when a packet finishes transmission, the next packet to be sent is the one with the smallest finish time (F_i). Their version of the algorithm is non-preemptive, such that the current packet in transmission cannot be preempted by a newly arriving packet with a smaller finishing time.
To separate promptness allocation from bandwidth, the authors introduce a parameter to express “bids” that affectively gives faster service to packets arriving at an inactive conversation. To characterize the promptness allocation for an arbitrary stream of packets, the authors consider two sources of packets: (1) FTP-like that is always ready with packets, so the transfer time is of interest, and (2) Telnet-like that intermittently producing packets, so packet delay is of interest. When a poisson Telnet source is sharing a gateway with three FTP sources, FQ is able to give a lower delay for lower throughput independent of the window sizes of the FTPs. This is a much better performance than FCFS. The authors further compare via simulation FQ versus FCFS using a variety of flow control algorithms. The takeaway from these simulations is that a fair queueing gateway algorithm needs to be combined with a good flow control algorithm to provide adequate congestion control.
This was a good paper to elaborate on the role and responsibility of the gateway when it comes to congestion and fairness. It demonstrates the motivation and advantages for distinguishing users and considering bandwidth, promptness, and buffer space allocation separately. The questions at the very end of the paper, however, leave me intrigued: what was/is the future of fair queueing. I also would have liked to see simulation results on a realistic workload in addition to their illustrative workloads.
Congestion avoidance, rather than congestion control, allows networks to operate in an optimal situation: low delay and high throughput. Their analysis is of decentralized decision making algorithms with binary feedback. The authors evaluate different increase/decrease algorithms for the following performance metrics:
- efficiency: closeness of total load to the knee
- fairness: users in the same bottleneck class should have equal share of that bottleneck
- distributed-ness: system gives minimal feedback
- convergence time: time taken to reach oscillation equilibrium, including size of oscillation
The paper starts by expressing system state transitions in n-dimensional space with n users; the optimal point is where the lines (hyperplanes?) for fairness and efficiency intersect. All points on the line to the origin have the same fairness (“equi-fairness”), and thus is not affected by multiplicative changes. The goal is to converge to that optimal state, however not all control policies are able to do so. In their analysis, the authors are looking to constrain the constants involved in the increase/decrease algorithms, viz. a_increase, a_decrease, b_increase, and d_decrease.
For efficiency convergence, the algorithm wants to ensure negative feedback: when asked to decrease, ensure the total doesn’t increase and when asked to increase, ensure the total load doesn’t decrease. For fairness convergence, must have one of two situations: (1) fairness goes up during decrease, and goes up or stays the same during increase, or (2) fairness goes up during increase, and goes up or stays the same during decrease. Fairness constrains the constants to be positive, and further that cannot have a_increase = b_increase = 0, or a_increase = a_decrease = 0. The distributed-ness requirement makes the condition for efficiency convergence stronger since the algorithm can’t include knowledge about other users. To satisfy distributed convergence while maintaining that for fairness and efficiency, authors prove that linear decrease should be multiplicative and linear increase should have additive component with possible multiplicative component no less than one. However, if each user truncates control before violating efficiency constraints, increase and decrease policies can each have both additive and multiplicative components. The authors then show optimal convergence for efficiency and fairness, which in turn show that for feasbility and optimal convergence to fairness, the increase part of the algorithm should be additive and the decrease should be multiplicative. Nonlinear controls are not suitable for practical purposes because they complicate finding the correct scaling factors, making these controls less robust.
Some background for this paper includes the origin of network congestion: mismatch in node arrival and service rates due to the heterogeneity of new technology like LANs and old technology like twisted pair. When network load increases to the point of a sudden drop in throughput (the “cliff”), packets start getting lost and the network is said to be congested. The point when the amount of load causes throughput increase to be small but response time starts to significantly increase is called the “knee.” The authors also note using a binary feedback scheme, which is a congestion avoidance algorithm in which resources monitor their usage and determine if they’re at an optimal level. Resources send binary feedback to the system in the packet header: a “1″ if overloaded and thus need to decrease load, or “0″ if underloaded and need to increase the load.
I like how this paper portrays their analysis through the vector representation of fairness and efficiency, and the iterative process of changing demand in the attempt to converge the optimal point in that space. The authors’ “narrowing down” of possible values of the algorithm’s constants was an effective way to demonstrate the optimal policy for their performance criteria.
Some questions that this paper left open were:
- how does delayed feedback effect control?
- would more feedback bits help?
- would guessing the total number of users help?
- how would asynchronoy affect the algorithm?
Congestion Avoidance and Control
September 8, 2008
This paper argues that the implementation of the transport protocols is the cause of most congestion problems, like dropped packets due to local buffer overflows. Obvious ways of implementing a window-based protocol can result in “wrong” behavior; the authors discuss and describe simple algorithmic solutions. After the congestion collapse and subsequent reduced bandwidth between LBL and UC Berkeley, the authors added some new algorithms to TCP, including: round-trip-time variance estimation, exponential retransmit timer backoff, slow-start, aggressive receiver ack policy, dynamic window sizing on congestion, clamped retransmit backoff, and fast retransmit. Many of these algorithms were motivated by the observation that data flow in TCP should obey the “conservation of packets” principle: at equilibrium, a new packet isn’t put into network until and old one leaves. This property implies that the protocol is self-clocking, since it uses acks to know when to send new packets.
Packet conversation can fail if (1) equilibrium isn’t reached, (2) the sender injects packets before an old packets exits, or (3) the system can’t reach equilibrium due to resource limitations. The self-clocking nature of the conservation principle can make it difficult to stat the system, so the authors developed a “slow-start” algorithm, using a congestion window, to gradually increase in-transit data. Regarding item (2): if the protocol implementation is correct, a sender prematurely introducing a packet represents a failure in the sender’s timer. This implies a need for a good round-trip estimator, motivating the authors to develop a method to estimate trip variation. Timing problems can also be caused by spacing when retransmitting packets; the best method is an exponential backoff.
When timers are okay, it’s probably that packet loss is due to network congestion rather than packet damage (which is rare). To avoid congestion, one needs to ensure that the network can signal endpoints that congestion is occurring and that endpoints have a policy to decrease utilization once this signal is received. The act of dropping a packet is enough announcement that the network is congested. The authors show that decreased utilization at endpoints is best done with a multiplicative decrease window size (an exponential decrease over time). A related issue is the task of increasing utilization for endpoints to discover the current limit, which is better done with an additive increase to avoid the costly mistake of overestimating. The authors note that algorithms at endpoints can’t ensure fair sharing; future work includes handling fair sharing responsibility at the gateway.
This paper is great for demonstrating the techniques and motivations for some of the additions to TCP. It’s always amazing how a few tiny (but important!) lines of code that produce dramatic changes. One assumption they make, however, is that a sender introducing a packet too early is a sign of a faulty timer. Adversarial senders could take advantage of the implications of self-clocking to receive packets at a higher rate. There were a bunch of tricks like this in a paper whose title I can’t remember at the moment.