Beyond the NAT: BBR, Bufferbloat, and Why “Fast” Networks Still Feel Slow

There is a particular kind of lie your network tells you.

You run a speedtest. The numbers look perfect. You breathe out. The server is “fine”.

Then you do something boring: pull a container image, push a release artifact, rotate a backup, let your NAS sync, or start a CI job that is basically a conveyor belt of uploads. Suddenly the machine feels haunted. SSH keystrokes arrive late. API p99 looks like a seismograph. Voice calls turn metallic. A game server starts rubber-banding.

Not because the link is slow, but because it is late.

That gap between bandwidth and responsiveness is what most “network tuning” ignores. It is also where the wins actually live: queueing and congestion control.

If you have lived behind CGNAT, mixed ISPs, tunnels, Wi-Fi dead zones, weird duplex layouts, “why is this VPS routing through another continent,” or that one peering path that feels like a tourist bus, then you already know:

The internet is not a pipe. It is a pile of queues.

And queues, left alone, grow teeth.

The real enemy is not throughput, it is queueing delay

When the link gets busy, packets do not vanish. They wait.

Buffers exist everywhere: ISP gear, modem/ONT, consumer routers, hypervisor virtual NICs, kernel queue disciplines (qdiscs), sometimes inside hardware built with the old belief that “bigger buffers equals fewer drops”.

That belief has a name: bufferbloat (excessive buffering that inflates latency under load).

Symptoms:

ping is fine when idle, awful during upload
a single big flow (backup/download) can starve “small” flows (SSH, API, DNS)
the network is fast while the experience is bad

In practice, “slow” is often not a bandwidth problem. It is queueing delay.

flowchart LR
  U[Bulk upload/download] --> Q[(Queue)]
  S[SSH / API / DNS / game ticks] --> Q
  Q --> B[Bottleneck link]
  B --> I[Internet]
  Q --- L[Latency under load rises as the queue grows]
  style Q stroke-width:2px

Congestion control: the part of TCP that decides “how hard to push”

TCP must decide how much data to keep in flight. Too little and you waste capacity. Too much and you build queues or trigger loss. That decision-making system is TCP congestion control.

For a long time, the mainstream approach treated packet loss as the primary signal of congestion: increase until loss, then back off. That can be fine in clean datacenters. It can be a mess on paths that are lossy, shaped, NAT’d, tunneled, or just “real”.

BBR: stop guessing by loss, start measuring bandwidth and RTT

BBR tries to estimate:

bottleneck bandwidth (how much the path can actually carry)
minimum RTT (best-case round trip time when queues are empty)

Then it paces sending to hover around an operating point that keeps throughput high without letting queues inflate into latency disasters.

BBR is not magic, and it has tradeoffs. It is also not a single thing forever. There are different versions (v1 vs v2) and behavior differs by kernel and distro. The internet has had fairness debates about BBR for years, especially when BBR flows share a bottleneck with classic loss-based flows. The practical approach is simple: treat it as a tool, not a religion. Enable it, measure, and keep it only if your workload gets better.

The unsung hero: fair queueing (fq), “let the SSH breathe”

Even the best congestion control can be undermined if your local queue is a single FIFO line. If one flow floods the queue, everything behind it waits.

That is why fair queueing matters.

On Linux servers, one of the simplest, highest-impact moves is setting the default qdisc to fq:

it splits traffic into flows and schedules them fairly
it prevents one bulk flow from smothering interactive flows
it pairs naturally with pacing (which BBR benefits from)

This is the boring, usually correct pairing for servers:

BBR + fq

The edge is where latency goes to die (and where you can save it)

If your home uplink is the bottleneck, you can tune your server all day and still suffer. Because the worst queue is often at the edge, where the uplink saturates.

Whoever owns the queue at the bottleneck owns your latency.

That is why the most effective move in homelabs is counterintuitive:

Make your router the bottleneck, slightly below the real link rate.

This is shaping: you keep ISP buffers from bloating by ensuring the queue that fills is the one you control. Pair that with active queue management (AQM) to keep the queue short under load.

Common AQM pieces:

CoDel
FQ-CoDel
CAKE

The goal is not “no drops”. The goal is “no giant waiting room”.

Myth vs reality

Myth 1: “My speedtest is great, so my network is great”

Reality: Speedtests are optimized for throughput, often to a nearby server, often using parallel streams, often over a path that is not the same path your services take.

You are measuring “how fast can I shove bytes to this cooperative endpoint right now,” not “how stable is my latency to the things I actually use.”

Speedtests can also be too flattering:

the test server might be inside your ISP
it might sit on a well-peered CDN edge
it might avoid the congested transit route your VPS uses

Myth 2: “Ping is the truth”

Reality: Ping is ICMP, and not all networks treat ICMP the same as real traffic.

Many devices:

deprioritize ICMP when busy
rate-limit it
handle it on a slow path (CPU) while forwarding TCP/UDP in hardware
block it in some hops (so missing replies do not always mean loss)

ICMP can lie about absolute numbers, but it is still great at exposing queue growth. If ping explodes when you saturate upload, you almost certainly have a queue/buffer problem somewhere.

Myth 3: “If it’s slow, it’s my server”

Reality: Your server might be fine; the path might not be.

Routing is policy. Interconnects get congested. Your traffic may traverse different upstreams depending on time, destination, and BGP decisions. That is why two VPS providers in the same city can feel like different planets.

This is governed by BGP and business relationships as much as physics.

When the route is bad, no sysctl will save you. When the route is okay but queues are bloated, shaping plus sane qdiscs can feel like magic.

Measurement notes (what I actually trust)

Ping is a smoke alarm. Useful, but not the whole story. When I want to know what is really happening, I rotate through a few tools:

mtr for a living picture of loss, jitter, and route weirdness:
```
mtr -rwz 1.1.1.1
```
If the route changes by time of day, or a hop shows bursts, you are looking at congestion, peering, or routing more than local tuning.
ss to inspect real TCP behavior on a live connection (retransmits, cwnd, pacing hints):
```
ss -ti dst 1.1.1.1
```
If retransmits jump when nothing else changed, the path is sick or overloaded.
A basic HTTP latency loop that looks closer to “my app feels slow”:
```
while true; do
  curl -o /dev/null -s -w "%{time_total}\n" https://example.com/health
  sleep 0.2
done
```
Run it while you saturate upload. If it turns spiky, that is your users.

None of these are perfect. Together they stop you from optimizing vibes.

Ping under load: the only test that matters

Speedtests are a snapshot of throughput. The lived experience is latency under load.

The simplest test:

start a ping
saturate upload/download
watch whether latency explodes

xychart-beta
  title "Ping under load (bufferbloat vs controlled queue)"
  x-axis "time (s)" 0 --> 60
  y-axis "RTT (ms)" 0 --> 300
  line "Before (bloated queues)" [12,12,13,12,13,15,18,20,25,35,55,80,120,160,210,240,260,240,220,200,180,160,140,120,100,85,70,60,55,50,45,40,35,30,28,25,23,22,21,20,19,18,18,17,17,16,16,15,15,14,14,14,13,13,13,13,12,12,12,12,12]
  line "After (shaping + fq + sane CC)" [12,12,12,12,13,13,13,13,13,14,14,14,15,15,15,15,15,15,14,14,14,14,14,14,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,13,12,12,12,12,12]

What you want is not “max throughput”. You want the line to stay flat when the network is busy.

Practical Linux: enable BBR + fq (ship-it defaults)

Check current:

sysctl net.ipv4.tcp_congestion_control
sysctl net.core.default_qdisc

Set (temporary):

sudo sysctl -w net.core.default_qdisc=fq
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

Persist in /etc/sysctl.d/99-network-tuning.conf:

net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

Apply:

sudo sysctl --system

Verify:

sysctl net.ipv4.tcp_congestion_control net.core.default_qdisc

If BBR is not available on your kernel, keep fq anyway. It still improves the “one flow ruins everything” failure mode.

Field notes: what to optimize depending on where you are

VPS (shared virtual NIC reality)

Start with fq + BBR (if the kernel supports it).
If you still see spikes, the provider’s host contention and peering may be your bottleneck, not your kernel.
“Feels slow” can be noisy neighbors or upstream congestion, not your box.

Bare metal (you own the NIC, you own the truth)

fq + BBR is still a great baseline.
If you push serious throughput, then you can go deeper (IRQ affinity, NIC queues, offloads), but only after measuring.
If your traffic is mostly latency-sensitive, focus on keeping queues small, not chasing max throughput.

Homelab / home edge (the bottleneck is usually your uplink)

The biggest win is shaping at about 92 to 97 percent of real rate, paired with AQM (FQ-CoDel/CAKE).
MTU/MSS correctness matters a lot if you use tunnels/overlays.
If the experience changes dramatically by time of day, suspect congestion/peering at the ISP level.

CGNAT + tunnels

Your route is now a product; you are paying in latency for reachability.
Treat MTU as guilty until proven innocent.
Expect that some protocols/paths behave differently; keep the system simple and observable.

UDP matters (and why TCP tuning still helps)

A lot of the pain people feel (voice, games, video calls) rides on UDP. So it is fair to ask: if I tune TCP, why would that help?

Because most of the real suffering comes from the bottleneck queue, and that queue is shared. When your uplink is saturated and buffers grow, everything waits. TCP packets wait. UDP packets wait. DNS waits. Even if UDP has its own congestion logic at the application layer, it cannot escape a bloated queue.

That is why edge shaping and AQM are so powerful. They improve latency under load for everything, not just TCP.

MTU: the gremlin that makes “some sites stall” and “tunnels feel cursed”

If you use overlays (WireGuard/OpenVPN), PPPoE, cloud tunnels, or anything with extra headers, MTU issues can surface as random stalls. Path MTU Discovery is supposed to handle it; sometimes it does not (ICMP filtering, broken middleboxes, etc.).

Quick test:

ping -M do -s 1472 1.1.1.1

If you hit fragmentation limits, you often fix it by:

lowering MTU on tunnel interfaces
clamping TCP MSS so endpoints do not send too-large segments

MSS is the max TCP payload per segment. Clamping it is the pragmatic “make it stop breaking” move (especially on gateways).

Offloads: do not touch unless you are the router (and you measured)

NIC offloads (TSO/GSO/GRO/LRO) can improve throughput, but in gateway/router roles they can:

distort shaping behavior
hide real packet behavior
create odd latency patterns in virtual switching/bridges/tunnels

On a VPS: usually leave it alone. On a bare-metal gateway: test and measure.

Inspect:

ethtool -k eth0

Test (only if you have a reason):

sudo ethtool -K eth0 gro off gso off tso off

If there is no measurable improvement, revert.

What “done” feels like

After these changes, you do not necessarily see a bigger speedtest number.

You see something more valuable:

SSH stays crisp while backups run
API p99 stops spiking during uploads
calls survive while someone pulls a multi-GB image
your homelab becomes boring, stable, predictable

You stop chasing Mbps and start shipping responsiveness.

Once you see the world as queues, once you internalize that the internet is mostly waiting rooms, you stop asking “how fast is it?” and start asking the only question that matters:

How bad does it get when it is busy?

Conclusion

Most of my “network performance” fixes have been boring, almost insulting in how simple they are.

Not exotic sysctls. Not magic NIC flags. Not “optimize the server harder”.

It is usually this: find where the bottleneck queue lives, then take ownership of it.

BBR is about being polite on the server side. Shaping and AQM are about keeping your edge civilized. MTU is about not stepping on rakes.

Everything else is details.