405. networking

Networking

Typing a URL or calling an API appears trivial, but beneath this simplicity lies an orchestration of protocols, addressing schemes, data formats, and layered abstractions that ensures data arrives correctly, efficiently, and securely across the globe.

https://www.youtube.com/watch?v=JDh_lzHO_CA
https://www.youtube.com/watch?v=p6ASAAMwgd8&t=0
https://www.youtube.com/watch?v=D26sUZ6DHNQ
https://www.youtube.com/watch?v=5VAM2QAGZIc
https://www.youtube.com/watch?v=K9L9YZhEjC0&t=951s
https://www.youtube.com/watch?v=GAyZ_QgYYYo

I

1.1. Packet Switching

Before computer networks, long-distance communication relied on circuit switching, where telephony systems established a dedicated end-to-end circuit for each call, reserving bandwidth for the entire duration even during silence. A transatlantic phone call in the 1960s occupied a fixed share of a submarine cable whether the parties were speaking or not. This model provided predictable latency and guaranteed bandwidth, but it was fundamentally wasteful for bursty data traffic, where a typical computer sends data in short, irregular bursts separated by long idle periods.

Packet switching, proposed independently by Paul Baran at RAND (1964) and Donald Davies at NPL (1965), broke data into discrete packets that are routed independently through a shared network. The key insight is statistical multiplexing, where many conversations share the same physical links by interleaving their packets, and the aggregate utilisation is far higher than dedicating a circuit to each. Each packet carries its own destination address and is forwarded hop by hop through intermediate nodes (routers), which make independent forwarding decisions based on routing tables. Packets from the same message may take different paths and arrive out of order, leaving reassembly to the destination.

ARPANET, funded by DARPA and first connected in October 1969 between UCLA and SRI, was the first general-purpose packet-switched network and served as the testbed for the protocols that became the modern internet. Two competing models emerged for structuring these protocols. The OSI model (1984) defined seven layers with strict separation of concerns, designed for theoretical completeness. The TCP/IP model, developed pragmatically through working implementations on ARPANET, collapsed this into four layers (Link, Internet, Transport, Application). TCP/IP prevailed because it shipped working code first and standardised around what worked, rather than designing a complete specification upfront. This pragmatism, favouring simplicity and deployability over architectural purity, continues to shape protocol design today.

1.2. Protocol Stack

The TCP/IP stack organises network communication into four layers, each providing services to the layer above and consuming services from the layer below. The Link layer (Ethernet, Wi-Fi) handles frame transmission over a physical medium between directly connected nodes, dealing with media access control and error detection at the hardware level. The Internet layer (IP) provides logical addressing and best-effort routing, enabling packets to traverse multiple heterogeneous networks from source to destination without either endpoint knowing the physical topology. The Transport layer (TCP, UDP) manages end-to-end communication between processes on different hosts, multiplexing connections via port numbers and optionally providing reliability, ordering, and flow control. The Application layer (HTTP, SSH, DNS, SMTP) implements the semantics of specific services.

Among application protocols, HTTP became dominant largely because the rise of the World Wide Web in the 1990s made it the universal interface for browsers, and this momentum carried forward into APIs, mobile backends, and cloud services. HTTP’s request-response model, human-readable headers, and extensibility through methods (GET, POST, PUT, DELETE) and status codes made it natural for both human-facing websites and machine-to-machine communication. The progression from HTTP/1.0 (one request per TCP connection) through HTTP/1.1 (persistent connections, pipelining) to HTTP/2 (binary framing, multiplexed streams over a single TCP connection) and HTTP/3 (QUIC over UDP) reflects continuous optimisation of the same basic semantics for modern network conditions.

Encapsulation is the mechanism by which each layer wraps the payload from the layer above with its own header (and sometimes trailer). An HTTP request becomes a TCP segment with source and destination port numbers and a 32-bit sequence number, then an IP packet with 32-bit source and destination addresses, then an Ethernet frame with 48-bit MAC addresses and a CRC-32 checksum. At the receiving end, each layer strips its header and passes the payload upward, a process called decapsulation. This layered design enables interoperability, as any application protocol can run over any transport over any network technology without modification.

Each layer defines its own protocol data unit (PDU). Frames at the Link layer, packets at the Internet layer, segments (TCP) or datagrams (UDP) at the Transport layer, and messages at the Application layer. The MTU (Maximum Transmission Unit, typically 1500 bytes for Ethernet) constrains the maximum payload at the Link layer. IP packets exceeding the path MTU are either fragmented into smaller packets (IPv4, though this is discouraged due to reassembly overhead and the risk of fragment loss) or dropped with an ICMPv6 “Packet Too Big” response that informs the sender to reduce its packet size, a mechanism called path MTU discovery.

1.3. Addressing

Network communication requires multiple addressing schemes operating at different layers. IP addresses provide logical identification. IPv4 uses 32-bit addresses (e.g. 192.168.1.1, yielding $2^{32} \approx 4.3 \times 10^9$ possible addresses), which proved insufficient as the internet grew. IPv6 extends this to 128 bits ($2^{128} \approx 3.4 \times 10^{38}$ addresses), enough to assign a unique address to every atom on Earth’s surface many times over. CIDR (Classless Inter-Domain Routing) notation (e.g. 10.0.0.0/8) specifies a network prefix length, dividing an address into a network portion and a host portion and enabling hierarchical address allocation. MAC addresses are 48-bit hardware identifiers assigned to network interface cards, used for local link-layer delivery within a single broadcast domain.

ARP (Address Resolution Protocol) bridges the gap between IP and MAC addresses on a local network segment. When a host needs to send a packet to an IP address on the same subnet, it broadcasts an ARP request (“who has 192.168.1.5?”) to all devices on the segment, and the owner of that IP responds with its MAC address. The result is cached in an ARP table to avoid repeated broadcasts. DHCP (Dynamic Host Configuration Protocol) automates IP address assignment through a four-step exchange known as DORA. The client broadcasts a Discover message, available servers respond with Offers containing proposed IP configurations, the client sends a Request for its preferred offer, and the server confirms with an Acknowledge, granting a time-limited lease on the assigned address.

DNS (Domain Name System) provides a hierarchical, distributed namespace that maps human-readable domain names to IP addresses. The resolution process begins at a recursive resolver (typically provided by the ISP or a public service like 8.8.8.8), which queries the hierarchy on behalf of the client. It first contacts one of the 13 root server clusters (operated by organisations including ICANN, Verisign, and NASA, distributed globally via anycast), which delegate to TLD (top-level domain) servers (.com, .org, .io), which in turn delegate to authoritative nameservers that hold the actual address records. Each response carries a TTL (time-to-live) value that controls how long the answer may be cached, creating a trade-off between resolution latency and propagation speed of DNS changes.

DNS is critical infrastructure because virtually every HTTP request and SSH connection begins with a DNS lookup, and misconfigured or unavailable DNS is a common source of outages in distributed systems. DNS record types extend beyond simple address mapping. A records map names to IPv4 addresses, AAAA records to IPv6, CNAME records create aliases, MX records direct email routing, TXT records carry arbitrary metadata (commonly used for domain verification and SPF/DKIM email authentication), and SRV records specify service locations with port numbers and priority weights. DNSSEC adds cryptographic signatures to DNS responses, preventing cache poisoning attacks where an attacker injects forged records into a resolver’s cache.

II

2.1. TCP

TCP (Transmission Control Protocol) provides reliable, ordered, byte-stream delivery over an unreliable packet-switched network. Connection establishment uses a three-way handshake. The client sends a SYN segment with an initial sequence number $x$, the server responds with SYN-ACK carrying its own initial sequence number $y$ and acknowledging $x+1$, and the client completes the handshake with ACK $y+1$. This exchange takes one round-trip time (RTT) before data transfer can begin. Connection teardown uses a four-way FIN handshake, after which the initiating side enters TIME_WAIT state for $2 \times$ MSL (Maximum Segment Lifetime, typically 60 seconds) to ensure late-arriving segments from the old connection are not confused with a new one on the same port.

Reliability is achieved through sequence numbers and acknowledgements. Each byte in the stream is numbered, and the receiver sends cumulative ACKs indicating the next expected byte. The sender maintains a sliding window of unacknowledged data and retransmits segments that are not acknowledged within a dynamically computed timeout (the retransmission timeout, or RTO, estimated from smoothed RTT measurements via Jacobson’s algorithm). Fast retransmit optimises this by not waiting for the timeout. When the sender receives three duplicate ACKs for the same sequence number, it infers that the next segment was lost and retransmits immediately, reducing recovery time from RTO (often hundreds of milliseconds) to roughly one RTT.

Flow control and congestion control are distinct mechanisms that are often confused. Flow control prevents the sender from overwhelming a slow receiver by using a receiver-advertised window (rwnd) that specifies how many bytes the receiver’s buffer can accept. Congestion control prevents the sender from overwhelming the network by maintaining a congestion window (cwnd) that estimates the network’s capacity. The sender transmits at the rate $\min(\text{rwnd}, \text{cwnd})$, so the bottleneck, whether at the receiver or in the network, always governs the sending rate.

Congestion control algorithms determine how cwnd evolves. Slow start begins with cwnd $= 1$ MSS (Maximum Segment Size) and doubles it each RTT (exponential growth) until reaching a threshold. Beyond the threshold, congestion avoidance increases cwnd by $1/\text{cwnd}$ per ACK (approximately one MSS per RTT, i.e. additive increase). On packet loss detected by timeout, cwnd resets to 1 (multiplicative decrease). CUBIC, the default in Linux since 2006, replaces the linear additive increase with a cubic function of time since the last congestion event, enabling faster recovery on high-bandwidth links. BBR (Bottleneck Bandwidth and Round-trip propagation time), developed by Google, takes a fundamentally different approach by building an explicit model of the network path, continuously estimating bottleneck bandwidth and minimum RTT, and pacing packets to match the estimated capacity rather than probing with loss.

2.2. UDP

UDP (User Datagram Protocol) provides connectionless, best-effort delivery with minimal overhead. Its header is only 8 bytes (source port, destination port, length, checksum), compared to TCP’s minimum 20-byte header, and there is no handshake, no acknowledgement, no retransmission, no ordering guarantee, and no flow or congestion control. Each datagram is independent, and the protocol makes no attempt to recover lost, duplicated, or reordered packets. This simplicity is not a deficiency but a deliberate design choice that makes UDP suitable for applications where the overhead of reliability is more costly than occasional data loss.

DNS queries use UDP because the typical request-response exchange fits in a single datagram, and the one RTT saved by skipping TCP’s three-way handshake matters for a service invoked before every HTTP connection. Real-time voice and video (VoIP, video conferencing) prefer UDP because retransmitting a stale audio frame that arrives 200 ms late is worse than simply dropping it and continuing with fresh data. Online games send frequent state updates (player positions, actions) at 20-60 Hz, where each update supersedes the last, making reliability for individual packets unnecessary. In each case, the application either tolerates loss, implements its own selective reliability on top of UDP, or treats the latest packet as the only one that matters.

Without built-in congestion control, a UDP application that sends at full rate regardless of network conditions risks contributing to congestion collapse, where the network is saturated with retransmissions and useful throughput approaches zero. Well-behaved UDP applications implement their own rate limiting or congestion-aware pacing. Packets exceeding the path MTU (typically 1500 bytes minus IP and UDP headers, leaving ~1472 bytes of payload) are fragmented at the IP layer in IPv4, but IP fragmentation is unreliable because the loss of any single fragment forces retransmission of the entire datagram. Applications should instead perform path MTU discovery and keep datagrams within the MTU, which is why protocols like DNS fall back to TCP for responses exceeding 512 bytes (the traditional safe UDP payload size).

2.3. Modern Transport

QUIC, developed by Google and standardised as the transport for HTTP/3, runs over UDP and provides TCP-like reliability with built-in TLS 1.3 encryption. Its central innovation is multiplexed streams without head-of-line blocking. In HTTP/2 over TCP, a single lost packet stalls all streams on that TCP connection because TCP enforces in-order delivery of the byte stream. QUIC implements independent streams at the transport layer, so a lost packet in one stream does not block data delivery on other streams. Connection establishment combines the transport handshake and TLS handshake into a single round-trip, and for resumed connections with cached credentials, QUIC supports 0-RTT data, sending application data in the very first packet.

RDMA (Remote Direct Memory Access) bypasses the operating system kernel entirely, allowing one machine’s NIC to read from or write to another machine’s memory without involving the CPU or kernel network stack on either side. This achieves single-digit microsecond latencies and near-wire-speed throughput, compared to tens of microseconds for kernel-mediated TCP. InfiniBand, a switched-fabric interconnect, provides native RDMA with bandwidth from SDR (10 Gbps) to NDR (400 Gbps). RoCE (RDMA over Converged Ethernet) brings RDMA to Ethernet networks, enabling RDMA in existing data centres without InfiniBand hardware. In large-scale ML training clusters, RDMA is essential for gradient synchronisation across GPUs, where the latency and CPU overhead of TCP would leave GPUs idle waiting for communication.

Network performance is characterised by latency (time for a single packet to traverse the path) and throughput (data volume per unit time), and the product of these two quantities determines the optimal operating point. The bandwidth-delay product ($\text{BDP} = \text{bandwidth} \times \text{RTT}$) represents the volume of data “in flight” needed to fully utilise a link. For a 10 Gbps link with 10 ms RTT, BDP $= 12.5$ MB, meaning the TCP window must be at least this large. Socket buffer sizes (SO_SNDBUF, SO_RCVBUF) should be tuned to match BDP for high-throughput transfers. Conversely, the TCP_NODELAY option disables Nagle’s algorithm (which batches small writes into larger segments to reduce header overhead) and is essential for latency-sensitive protocols like SSH interactive sessions and RPC calls where every keystroke or request should be sent immediately.

III

3.1. TLS and SSH

TLS (Transport Layer Security) provides encryption, authentication, and integrity for TCP connections. In TLS 1.3 (2018), the handshake begins with the client sending a ClientHello containing supported cipher suites and key shares (pre-computed Diffie-Hellman parameters). The server responds with a ServerHello selecting the cipher suite, its own key share, and its X.509 certificate. Both sides derive the shared secret from the ECDHE (Elliptic Curve Diffie-Hellman Ephemeral) exchange, and the handshake completes in a single round-trip. For resumed connections where the client has cached a pre-shared key from a previous session, TLS 1.3 supports 0-RTT data, sending encrypted application data in the very first flight, though at the cost of replay vulnerability for non-idempotent requests.

The server’s identity is verified through a certificate chain of trust. The server presents a certificate signed by an intermediate CA (Certificate Authority), which is itself signed by a root CA whose certificate is pre-installed in the client’s trust store (browsers and operating systems ship with ~100-150 trusted root certificates). The client validates each signature in the chain and checks that the certificate’s Common Name or Subject Alternative Name matches the requested hostname. Common errors like SSL: CERTIFICATE_VERIFY_FAILED or ERR_CERT_AUTHORITY_INVALID typically indicate expired certificates, hostname mismatches, or missing intermediate certificates that break the chain. Tools like curl -v and openssl s_client reveal the full handshake and certificate details for debugging.

TLS achieves forward secrecy through the use of ephemeral key exchange parameters. Even if an attacker obtains the server’s long-term private key (used only to authenticate the server via its certificate), they cannot decrypt previously recorded sessions because each session used a unique ephemeral Diffie-Hellman key pair that was discarded after the handshake. This is why TLS 1.3 mandates ephemeral key exchange and removes support for static RSA key exchange, where the same server key was used to encrypt the session key and a compromised key could decrypt all past traffic.

SSH (Secure Shell) provides encrypted remote shell access and secure tunnelling, using a similar cryptographic foundation to TLS but with a different trust model. Rather than certificate authorities, SSH uses public key authentication where the client proves possession of a private key corresponding to a public key stored in the server’s ~/.ssh/authorized_keys. SSH tunnels (-L for local forwarding, -R for remote forwarding, -D for dynamic SOCKS proxy) create encrypted channels through which arbitrary TCP traffic can be forwarded, enabling access to services behind firewalls or NAT. Agent forwarding allows credentials to pass through intermediate bastion hosts (hardened jump servers that serve as the single entry point to private networks) without copying private keys to untrusted machines, which is essential for accessing GPU training clusters behind corporate firewalls.

3.2. Cloud Networking

Cloud providers organise networking into VPCs (Virtual Private Clouds), logically isolated networks within a provider’s infrastructure that give each tenant control over IP address ranges, subnets (partitions of the VPC mapped to availability zones), route tables (rules determining where packets are forwarded), and internet gateways (allowing outbound traffic to the public internet). Instances in private subnets communicate with external services through NAT gateways that translate private source IPs to a public IP, while instances in public subnets receive directly routable addresses. Security groups act as stateful virtual firewalls attached to instances, filtering traffic by protocol, port, and source/destination, while network ACLs provide stateless filtering at the subnet level.

Container networking builds on Linux kernel primitives. Each container receives its own network namespace with isolated interfaces, routing tables, and iptables rules. A veth (virtual Ethernet) pair connects the container’s namespace to a bridge on the host, with one end inside the container and the other attached to the bridge. NAT rules (via iptables) translate between the container’s private IP and the host’s external IP for outbound traffic. For multi-host communication, overlay networks encapsulate layer-2 frames within UDP packets using VXLAN, creating virtual network segments that span physical host boundaries and give each container a cluster-wide routable IP without manual NAT configuration.

Kubernetes extends container networking through CNI (Container Network Interface) plugins that implement the cluster’s networking model. The model requires that every pod receives a unique IP address and that any pod can communicate with any other pod without NAT, creating a flat network. Calico uses BGP to advertise pod routes across nodes, avoiding encapsulation overhead. Flannel uses VXLAN overlays for simplicity. Cilium leverages eBPF to implement networking, security, and observability directly in the kernel. Services provide stable virtual IPs that load-balance across ephemeral pod endpoints, with L4 load balancers forwarding TCP connections by IP and port and L7 load balancers routing HTTP requests by URL path, headers, or cookies.

At the application layer, the Model Context Protocol (MCP), open-sourced by Anthropic in November 2024, defines a standard for connecting AI applications to external data sources and tools. MCP follows a client-server architecture over JSON-RPC 2.0, reusing the message-flow design of the Language Server Protocol (LSP). Two transport mechanisms are supported, stdio for local servers running as subprocesses (low latency, no network overhead) and HTTP with SSE for remote servers accessible over the network. MCP servers expose resources (contextual data), tools (executable functions), and prompts (templated interactions) through a capability negotiation handshake, enabling AI agents to discover and invoke external services through a uniform protocol rather than ad-hoc integrations.

3.3. Distributed ML Communication

PyTorch’s torch.distributed package supports multiple communication backends. NCCL (NVIDIA Collective Communications Library) is optimised for GPU-to-GPU transfers, detecting the physical topology (NVLink between GPUs on the same node, PCIe within a node, InfiniBand or RoCE across nodes) and selecting communication paths accordingly. Gloo provides CPU-based and heterogeneous communication over TCP/IP, suitable for smaller-scale training or environments without InfiniBand. MPI (Message Passing Interface) targets traditional HPC clusters with vendor-optimised implementations. Multi-GPU training on a single node typically uses shared memory or NVLink, while multi-node training communicates over RDMA (InfiniBand) or TCP, with the choice of backend significantly affecting training throughput.

The AllReduce operation, which computes a reduction (typically summation) of gradients across all $N$ workers and distributes the result to every worker, is the primary communication bottleneck in data-parallel training. The naive approach of gathering all gradients at one node and broadcasting the result back requires $O(N)$ data transfers through a single bottleneck. Ring-allreduce arranges workers in a logical ring and completes the operation in $2(N-1)$ steps, transferring a total of $\frac{2(N-1)}{N}$ times the gradient buffer size per worker, which asymptotically approaches $2\times$ the model size regardless of worker count. NCCL auto-detects the GPU topology and selects between ring-allreduce, tree-allreduce (lower latency for small messages due to $O(\log N)$ depth), and hybrid algorithms depending on message size and interconnect characteristics.

In elastic training environments where workers may join or leave during a run, a rendezvous mechanism coordinates worker discovery. PyTorch Elastic’s –rdzv_backend supports multiple backends (etcd, c10d, or static file) for this coordination. Workers register with the rendezvous endpoint, wait until the expected number of participants have joined (or a timeout triggers rescaling), and exchange connection information including hostnames, ports, and rank assignments. The rendezvous endpoint must be reachable by all workers, and hostname and IP consistency across VMs and containers is critical, as mismatches between /etc/hostname, DNS records, and the addresses used for NCCL communication are a common source of failures.

Debugging distributed training networking failures requires systematic diagnosis because symptoms (hangs, timeouts, slow steps) manifest far from the root cause. Setting NCCL_DEBUG=INFO logs the topology detection and communication path selection, revealing whether NCCL is using the expected interconnect. ss and netstat display active connections and their states, identifying port conflicts or connections stuck in CLOSE_WAIT. tcpdump and Wireshark capture packets on the wire for protocol-level diagnosis. Common failure modes include DNS resolution failures (workers cannot find each other), port binding conflicts (MASTER_PORT already in use), firewall rules blocking NCCL traffic on non-standard ports, and asymmetric network bandwidth causing stragglers that slow the entire collective operation.

I gathered words solely for my own purposes without any intention to break the rigorosity of the subjects.
Well, I prefer eating corn in spiral .

Hikikomori