LINUX Congestion controls

A general overview of the network TCP congestion controls implemented in LINUX OS.

What is congestion control?

The vast majority of data on the Internet today is transmitted using the Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) protocols. UDP is a simple stateless protocol that sends packets thorough the network without checking if they arrived to the destination. On the other hand, TCP is a more complex stateful protocol that establishes a connection between the source and destination of the packet and checks if it arrived correctly to the destination. TCP also supports additional functionality such as segmentation, reliability and congestion control.

TCP's congestion control is used to prevent the congestion of packet transfer on the given network link. It uses additive increase / multiplicative decrease (AIMD) scheme, with additional schemes including slow start and congestion window (CWND). The AIMD is a closed-loop control algorithm that combines linear growth of the congestion window width an exponential reduction when a congestion takes place. The CWND is maintained by the sender and enables sender to stop a link between it and the receiver from becoming overloaded with to much traffic. The size of the congestion window is determined by estimating how much congestion there is on the link. The flow of data over a TCP connection is controlled by the receive window advertised by the receiver and sender can only send less data then is own CWND and the receive window. Slow start begins with the initial CWND size and increases it by 1 for every successful acknowledgement received until the packet loss is detected or receive window size is reached.

Congestion control in Linux

There are a plethora of algorithms for TCP congestion control. However, Linux only implements a subset of said algorithms, which are listed below.

Binary Increase Congestion control (BIC)

BIC is used by default in Linux kernels 2.6.8 through 2.6.18 from August 2004 until November 2006. The protocol adapts its CWND depending on the binary search increase and additive increase bandwidth probing parameters. The BIC algorithm has the following parameters:

Smax:    the maximum increment
Smin:    the minimum increment
Wmax:    the maximum window size  
B:       multiplicative window decrease factor
cwnd:    congestion window size  
bic_inc: window increment per RTT (round trip time)

The algorithm updates the CWND window at every RTT interval. In the case, when no packets are dropped the CWND increases with one of the three available ways: binary search increase, additive increase or slow start:

if (cwnd < Wmax)                    // binary search OR additive
    bic_inc = (Wmax - cwnd) / 2;
else                                // slow start OR additive
    bic_inc = cwnd - Wmax;

if (bic_inc > Smax)                 // additive
    bic_inc = Smax;
else if (bic_inc < Smin)            // binary search OR slow start
    bic_inc = Smin;

cwnd = cwnd + (bic_inc / cwnd);

When one or more packets are dropped, the CWND is reduced using the multiplicative decrease. The decrease happens in two ways using fast or slow convergence depending on the current CWND size:

if (cwnd < Wmax)                    // fast convergence
    Wmax = cwnd * (2-B) / 2;
else                                // slow convergence
    Wmax = cwnd;

cwnd = cwnd * (1-B);

Cubic Binary Increase Congestion control (CUBIC)

CUBIC has been used by default in Linux kernels since version 2.6.19, which was released in November 2006. The protocol improves upon BIC, by modifying the CWND size by using the cubic function of time since last congestion event, which allows the network to stabilize before the more bandwidth is allowed. CUBIC determines the CWND size independently from real-time and is not RTT dependent and simpler than BIC. The CUBIC algorithm has the following parameters

B:      Multiplicative decrease factor (default value = 0.7)
Wmax:   Window size just before the last reduction
T:      Time elapsed since the last window reduction
C:      A scaling constant (default value = 0.4)
cwnd:   The congestion window at the current time

The size of CWND is then modeled as follows:

$cwnd = C(T-K)^3+w_{max}$ ; where $K=\sqrt{\frac{w_{max}(1-\beta)}{C}}$

Proportional Rate Reduction (PRR)

PRR has been in the Linux kernel -- to improve loss recovery -- since version 3.2, which became available in January 2012. The protocol determines the number of new packets to be send on the basis of number of acknowledgements received to provide a better recovery, close to the slow start as possible after a packet loss event is detected. PRR algorithm has the following parameters:

prr_delivered:  Total bytes delivered to the receiver since the start of the recovery
prr_out:        Total bytes transmitted during recovery.
RecoverFS:      The FlightSize at the start of recovery.
DeliveredData:  The number of bytes that the current acknowledgement indicates that have been delivered to the receiver.
sndcnt:         Determines how many bytes to send in response to each ACK.

The initialization on entering recover:

sshthresh = CongCtrlAlg()       // target CWND after recovery
prr_delivered = 0               // total bytes delivered during recovery
prr_out = 0                     // total bytes sent during recovery
RecoverFS = snd.nxt - snd.una   // FlightSize at the start of recovery

On every acknowledgement during recovery compute:

DeliveredData = delta(snd.una) + delta(SACKd)                       // DeliveredData is number of new bytes that the current acknowledgment indicates have been delivered to the receiver.
prr_delivered += DeliveredData
pipe = PipeAlgorithm()                                              // pipe algorithm according RFC 3517

if (pipe > ssthresh)
    sndcnt = CEIL(prr_delivered * ssthresh/RecoverFS) - prr_out     //proportional rate reduction
else
    ss_limit = MAX(prr_delivered - prr_out, DeliveredData) + 1      // slow start
    sndcnt = MIN(ssthresh - pipe, ss_limit)

sndcnt = MAX(sndcnt, 0)
cwnd = pipe + sndcnt

On any data transmission or retransmission:

prr_out += data_sent

At the end of recovery:

cwnd = ssthresh

Bottleneck Bandwidth and Round-trip propagation time (BBR)

BBR is incorporated in Linux kernels to enable model-based congestion control. This has been the case since version 4.9, which became available in December 2016. The protocol builds the model of the network over time from the maximum bandwidth and round trip time at which the network delivered the last number of packets. It exploit the latency associated with the bufferbloat instead of packet loss to better model the maximum throughput and provide higher throughput and lower latency. The BBR algorithm has two main parts. First, each new acknowledgement provides new RTT and average delivery rate measurements that update the algorithms estimates:

rtt = now - packet.sendtime
update_min_filter(RTpropFilter, rtt)
delivered += packet.size
delivered_time = now
deliveryRate = (delivered - packet.delivered) / (delivered_time - packet.delivered_time)

if (deliveryRate > BtlBwFilter.currentMax || ! packet.app_limited)
    update_max_filter(BtlBwFilter, deliveryRate)

if (app_limited_until > 0)
    app_limited_until = app_limited_until - packet.size

Second, it paces the sending rate of each data packet to match link's bottleneck:

bdp = BtlBwFilter.currentMax × RTpropFilter.currentMin

if (inflight >= cwnd_gain × bdp)
    // wait for ack or retransmission timeout
    return

if (now >= nextSendTime)
    packet = nextPacketToSend()
    if (! packet)
        app_limited_until = inflight
        return
    packet.app_limited = (app_limited_until > 0)
    packet.sendtime = now
    packet.delivered = delivered
    packet.delivered_time = delivered_time
    ship(packet)
    nextSendTime = now + packet.size / (pacing_gain × BtlBwFilter.currentMax)

timerCallbackAt(send, nextSendTime)

Glossary

Layer 4 (Transport Layer)

The layer of the OSI model that handles traffic between hosts and clients (TCP/UDP).

TCP

Transmission Control Protocol.