Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC

Overview

This tutorial dives into a subtle but critical bug in the CUBIC congestion control algorithm when ported from the Linux kernel to a QUIC implementation (quiche). The bug caused the congestion window (cwnd) to become permanently stuck at its minimum value after a congestion collapse, preventing recovery. You'll learn the underlying mechanics, step-by-step reproduction, a simple fix, and common pitfalls to avoid.

Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC
Source: blog.cloudflare.com

Prerequisites

Step-by-Step Instructions

1. Understanding CUBIC's Core Logic

CUBIC, standardized in RFC 9438, is the default congestion controller in Linux. It manages the cwnd to probe for available bandwidth: increasing cwnd when no loss is detected (probing), and decreasing it when loss occurs (backoff). The algorithm uses a cubic function (hence the name) to grow cwnd after a loss event, aiming for better network utilization.

2. The Bug: A Stuck Congestion Window

The bug manifests when the connection experiences heavy loss early, driving cwnd to cwnd_min (typically 2 or 4 packets). Normally, after a loss event, CUBIC should eventually recover and grow cwnd. However, due to an interaction with the app-limited exclusion (RFC 9438 ยง4.2-12), the cwnd becomes permanently stuck at the minimum. The app-limited rule is designed to prevent premature growth when the application isn't sending enough data, but a logic error causes CUBIC to never exit the recovery state.

3. Reproducing the Bug

To reproduce, set up a QUIC connection using quiche with CUBIC as the congestion controller. Simulate heavy packet loss (e.g., 50% loss rate) during the first few round trips. Monitor cwnd over time. Expected behavior: cwnd drops to cwnd_min and stays there indefinitely.

# Example using quiche's test harness (pseudo-code)
let mut cc = Cubic::default();
cc.on_loss(initial_packet);  // heavy loss
assert!(cc.cwnd == cwnd_min);
// Simulate many ACK rounds without growth
for _ in 0..1000 {
    cc.on_ack(now());
}
assert!(cc.cwnd == cwnd_min);  // fails because cwnd never increases

4. Root Cause Analysis

The bug stems from the porting of a Linux kernel patch that aligned CUBIC with the app-limited exclusion. In the Linux TCP stack, the app-limited check is wrapped inside a larger condition that only applies when the connection is not in recovery (i.e., after a loss event). In quiche's port, that guard was omitted, causing the app-limited exclusion to fire even during recovery, preventing CUBIC from ever leaving the minimum cwnd. The exact location is in the cubic_update() function where tcp_friendliness adjustments are made.

Mastering CUBIC Congestion Control: Debugging a Stuck Congestion Window in QUIC
Source: blog.cloudflare.com

5. The Fix: A One-Line Change

The fix adds a condition to skip the app-limited check when the connection is still in the recovery phase. In the quiche source, this is a single line added to cubic.rs:

// Before (buggy):
if app_limited { return; }

// After (fixed):
if app_limited && !self.recovery { return; }

This ensures that during recovery (post-loss), CUBIC continues to grow cwnd even if the application is not fully utilizing the window. Once recovery ends, the original app-limited logic applies.

6. Verifying the Fix

Re-run the reproduction test. The cwnd should now start increasing after recovery, eventually leaving the minimum. Use a debug trace to confirm the sequence:

Common Mistakes

Summary

This tutorial walked through a real-world bug where CUBIC's congestion window got stuck at minimum due to a misapplied app-limited exclusion in a QUIC implementation. By understanding the core logic, reproducing the issue, and applying a one-line fix, we prevented permanent throughput collapse. Key takeaways: always verify edge-case behavior, avoid blind code porting, and test recovery paths thoroughly.

For further details, refer to the original overview or explore the quiche source code.

Recommended

Discover More

Maximizing Go Performance with the Green Tea Garbage Collector: A Hands-On TutorialHow to Build a Scalable Analytics Service with Swift: Lessons from TelemetryDeckThe Art of Debugging and Asking Better Questions: From Rubber Ducks to Stack OverflowModernizing Go Code with go fix: A Complete GuideOracle Shifts to Monthly Security Patches in Race Against AI-Powered Cyber Threats