2025 Dec 04

Scaling the Load Balancer: from single process to distributed state with redis

Introduction

This is the third post in the series of simple load balancer saga. I’ve covered uvicorn/starlette and pure asyncio implementations so far. Both have the same limitation: they do not work according to the spec when running as multiple processes or on multiple instances. That’s because the spec defines that an upstream can be registered at any time, and all processes/instances would need to synchronize state in order to fulfill the round-robin requirement.

The original specification is silent on scaling, so in my opinion, one should not bother with implementing a distributed solution within the limited time given for coding. On the other hand, the scaling issue is one of the more obvious questions that could be asked after presenting a single-process solution. In this post, I’m going to show how to extend the naive, single-process solution by moving the configuration state out of the process memory into a shared control plane.

Design space

There are a couple of things to note about the initial requirements of the spec:

An upstream can be registered at any time;
All instances should be aware of all upstreams so that the round-robin can be fulfilled

Let’s also assume that the load balancer should be as transparent as possible to the upstream servers—it should not introduce too much latency, for example. Any locking mechanism will introduce additional latency, though a lock on read would be much worse than a lock on write. Additionally, if each instance is going to fetch the list of targets for each incoming request, it is going to be inefficient and could render the shared storage a bottleneck.

The load balancer must be able to scale out to many machines. Therefore, when I refer to the LB process, I implicitly mean that it can be placed on the same or a different machine.

There are some solutions to state synchronization with strong consistency guarantees e.g. Apache ZooKeeper or etcd. However, I think eventual consistency is enough to comply with the specification. Simply put, we don’t need linearizability for a load balancer; if one LB instance is 200ms behind the other regarding a new upstream, nobody dies. Yet in case of eventual consistency a more complicated setup involving Zookeeper/etcd can be avoided.

Here is a diagram representing the overall request flow:

flowchart TD
%% --- STYLING DEFINITIONS ---

%% Client: Cool Gray (Matches System Operator)
    classDef client fill:#90A4AE,stroke:#455A64,stroke-width:2px,color:#000;

%% DNS: Sky Blue (Matches Redis/Infrastructure)
    classDef dns fill:#4FC3F7,stroke:#0277BD,stroke-width:2px,color:#000;

%% Load Balancers: Saturated Orange (Matches LB Cluster)
    classDef lb fill:#FFB74D,stroke:#E65100,stroke-width:2px,color:#000;

%% Upstreams: Light Green (New color for Targets, high visibility)
    classDef upstream fill:#AED581,stroke:#558B2F,stroke-width:2px,color:#000;

%% --- NODES ---

    A[Clients]:::client
    B[DNS Server]:::dns

    C[LoadBalancer1\n172.253.253.2]:::lb
    D[LoadBalancer2\n172.253.253.3]:::lb

    E[Upstream1\n172.253.253.20]:::upstream
    F[Upstream2\n172.253.253.21]:::upstream

%% --- CONNECTIONS ---

    A -->|DNS Query| B
    B -->|Resolved to| C
    B -->|Resolved to| D
    C -->|Forward Request| E
    C -->|Forward Request| F
    D -->|Forward Request| E
    D -->|Forward Request| F

%% --- LINK STYLING ---

%% Link 0 (Client -> DNS): Medium Purple (External Request)
    linkStyle 0 stroke:#AB47BC,stroke-width:3px;

%% Links 1,2 (DNS -> LB): Deep Orange (Routing to LB)
    linkStyle 1,2 stroke:#F57C00,stroke-width:3px;

%% Links 3-6 (LB -> Upstream): Medium Green (Traffic to Target)
    linkStyle 3,4,5,6 stroke:#2E7D32,stroke-width:3px;

First, we will leverage DNS A records in order to have multiple instances of load balancers. Next, the HTTP client will resolve the DNS and use one of the load balancers’ IPs according to round-robin order. The client will then connect to the load-balancer IP and make an HTTP request. The load balancer inspects the HTTP headers, rewrites them, and forwards the request to one of the upstream servers, again using a round-robin fashion.

Redis to the rescue

Redis has a nice feature called keyspace notifications which allow clients to subscribe to Pub/Sub channels in order to receive events affecting the Redis data set.

Let’s suppose that we keep a list of upstream servers in Redis. Whenever a new upstream server gets registered, all load-balancer processes will get notified via a keyspace notification. Each process will then pull the list of upstream servers and update its internal state. This way, we get rid of the problem of fetching the list of upstream servers for each request.

What about writing to redis? Can we avoid locks? Let’s look at our data model first:

upstreams = {
    '/path1': {'8.8.8.8:80', '10.10.10.10:80'},
    '/path2': {'9.9.9.9:80', '10.10.10.10:80'}
}

For each path, e.g., /path1 and /path2, we keep a set of upstream servers defined as a list of IPs concatenated with the port. It seems that Redis sets will be a good fit here. The set will be keyed by path, like path1, and contain upstream servers for the given path in the form of strings IP:PORT. Example upstream data translated to Redis commands would look like this:

> SADD path1 8.8.8.8:80
> SADD path1 10.10.10.10:80
> SADD path2 9.9.9.9:80
> SADD path2 10.10.10.10:80

When the load balancer initializes, it will fetch all paths and upstream servers for each path. Then, it will subscribe to the __keyspace@0__:* channel to get notified about changes in upstreams. Whenever a new upstream server is registered, it will get notified and update its internal state. Such a message will look like this:

{'type': 'pmessage', 'pattern': b'__keyspace@0__:*', 'channel': b'__keyspace@0__:paths', 'data': b'sadd'}

The cool thing is that we can bind to a specific Redis database and listen to events for specific keys even. This way, we can avoid handling unnecessary notifications. In fact, let’s upgrade our data model to have a separate namespace slb_ for paths and upstreams.

First, let’s observe what happens in our keyspace using redis-cli:

redis-cli config set notify-keyspace-events KEA
redis-cli --csv psubscribe '__keyspace@0__:slb_*'
Reading messages... (press Ctrl-C to quit)
"psubscribe","__keyspace@0__:slb_*",1

Then let’s register new upstream servers:

127.0.0.1:6379> SADD slb_path1 8.8.8.8:80
(integer) 1
127.0.0.1:6379> SADD slb_path1 10.10.10.10:80
(integer) 1
127.0.0.1:6379> SADD slb_path2 9.9.9.9:80
(integer) 1
127.0.0.1:6379> SADD slb_path2 10.10.10.10:80
(integer) 1

We should see the following messages in the redis-cli output:

"pmessage","__keyspace@0__:slb_*","__keyspace@0__:slb_path1","sadd"
"pmessage","__keyspace@0__:slb_*","__keyspace@0__:slb_path1","sadd"
"pmessage","__keyspace@0__:slb_*","__keyspace@0__:slb_path2","sadd"
"pmessage","__keyspace@0__:slb_*","__keyspace@0__:slb_path2","sadd"

Let’s verify what targets are defined for each path:

127.0.0.1:6379> SMEMBERS slb_path1
1) "8.8.8.8:80"
2) "10.10.10.10:80"
127.0.0.1:6379> SMEMBERS slb_path2
1) "10.10.10.10:80"
2) "9.9.9.9:80"

Therefore, Redis can act as our Control Plane. By decoupling the routing state (stored in Redis) from the request processing, we allow the Data Plane to scale statelessly while the configuration changes dynamically. The Pub/Sub logic will be the synchronization channel between the Control Plane and the Data Plane. Here is a diagram representing the Control Plane we can build using Redis:

flowchart TD
%% --- STYLING DEFINITIONS ---

%% System Operator: Cool Gray/Blue
    classDef actor fill:#90A4AE,stroke:#455A64,stroke-width:2px,color:#000;

%% Load Balancers: Saturated Orange
    classDef lb fill:#FFB74D,stroke:#E65100,stroke-width:2px,color:#000;

%% Redis: Sky Blue
    classDef redis fill:#4FC3F7,stroke:#0277BD,stroke-width:2px,color:#000;

%% Nodes
    SysOp[System Operator]:::actor

    subgraph Cluster ["Load Balancer Cluster"]
        direction TB
    %% Force subgraph border to be visible on both (Medium Gray)
        style Cluster fill:transparent,stroke:#9E9E9E,stroke-width:2px,color:#9E9E9E

        LB1[LB Instance 1\nReceives API Call]:::lb
        LB2[LB Instance 2\nPeer]:::lb
    end

    subgraph Store ["Shared State"]
        style Store fill:transparent,stroke:#9E9E9E,stroke-width:2px,color:#9E9E9E
        Redis[("Redis\nKey: slb_path1")]:::redis
    end

%% --- CONNECTIONS ---

%% 1. API Call (Purple)
    SysOp -- "1. POST /register\n{path: /p1, target: 10.0.0.1}" --> LB1

%% 2. Persistence (Darker Orange)
    LB1 -- "2. Update State\n(SADD slb_path1 ...)" --> Redis

%% 3. Signal (Teal/Cyan)
    Redis -.-> |"3. Pub/Sub Signal\n(Key Modified)"| LB1
    Redis -.-> |"3. Pub/Sub Signal\n(Key Modified)"| LB2

%% 4. Sync (Green)
    LB1 -- "4. Sync State\n(SMEMBERS)" --> Redis
    LB2 -- "4. Sync State\n(SMEMBERS)" --> Redis

%% --- LINK STYLING ---
%% Removed 'color' property, keeping only 'stroke' and 'stroke-width'

%% Link 0 (API): Medium Purple
    linkStyle 0 stroke:#AB47BC,stroke-width:3px;

%% Link 1 (Write): Deep Orange
    linkStyle 1 stroke:#F57C00,stroke-width:3px;

%% Link 2,3 (Signal): Cyan (Dashed)
    linkStyle 2,3 stroke:#00838F,stroke-width:3px,stroke-dasharray: 5 5;

%% Link 4,5 (Sync): Medium Green
    linkStyle 4,5 stroke:#2E7D32,stroke-width:3px;

State convergence and race conditions

How does the load balancer keep its internal state in sync with Redis?

On startup (or reconnection), the LB performs a Full Sync (SMEMBERS);
Then it enters Incremental Sync (Pub/Sub); SADD notifications trigger the fetching of updated targets for the affected path;
If the Pub/Sub connection drops, it must perform a Full Sync again upon reconnection.

sequenceDiagram
%% --- STYLING STRATEGY ---
%% We use RGBA with low opacity (0.15 - 0.2). 
%% This allows the underlying background (White or Black) to bleed through.
%% Result: Soft pastel on White mode, Subtle dark tint on Dark mode.

%% 1. OPERATOR COLUMN (Cool Blue-Grey)
    box rgba(84, 110, 122, 0.2) System Operator
        participant O as Operator
    end

%% 2. LB CLUSTER COLUMN (Deep Orange tint)
    box rgba(239, 108, 0, 0.15) LB Cluster
        participant LB1 as LB (Writer)
        participant LB2 as LB (Peer)
    end

%% 3. REDIS COLUMN (Deep Blue tint)
    box rgba(2, 119, 189, 0.15) Shared State
        participant R as Redis
    end

%% --- PHASE 1: API & PERSISTENCE (Purple tint) ---
    rect rgba(142, 36, 170, 0.15)
        Note over O, LB1: 1. The API Call
        O->>LB1: POST /register {path: /p1, target: 10.0...}

        Note over LB1, R: 2. The Write
        LB1->>R: SADD slb_path1 10.0.0.1:80
        R-->>LB1: (OK)
        LB1-->>O: 200 OK
    end

%% --- PHASE 2: SIGNAL (Cyan tint) ---
    rect rgba(0, 131, 143, 0.15)
        Note over R, LB2: 3. The Signal (Push)
        R->>LB2: PUBLISH "sadd" (Event)
        Note over LB2: Event Received
    end

%% --- PHASE 3: SYNC (Green tint) ---
    rect rgba(46, 125, 50, 0.15)
        Note over LB2, R: 4. The Synchronization (Pull)
        LB2->>R: SMEMBERS slb_path1
        R-->>LB2: ["10.0.0.1:80", ...]
        Note over LB2: Update Local Memory
    end

What about the atomicity of operations? Don’t we need locks? There are two possible operations:

Writing, when a new path or a new target is added;
Reading, when we fetch all paths and targets for each path.

Let’s look at the first operation. The SADD command is a commutative ¹ and atomic operation. If a target is already in the set, it will not be added again. Even if two processes try to add a new target at the same time, both will succeed instead of having to deal with conflict resolution or overwrites. This property of Redis sets solves the problem of overwriting data by concurrent writes, also known as the last-write-wins problem.

The second operation, the reading of targets for each path, is not atomic. That’s because we wait for a notification (essentially, a read via PSUBSCRIBE) and then perform another read for the target key via SMEMBERS, so it involves two read operations before the state is updated on any load balancer instance. If the state changes between the PSUBSCRIBE read and the SMEMBERS read, it’s still fine. Two coroutines could get triggered at some LB instance, and the LB will eventually get a consistent view of the data. This holds even if the order of coroutines gets mixed up due to delays in the network stack.

All of the above is true if we don’t have to deal with removing targets. But what if we have to? After all, it’s a pretty common operation in the real world, but it’s not part of the specification. Let’s see how we can handle this.

Redis has the SREM command which removes a member from the set. It’s also an atomic and commutative operation. One might worry about race conditions: what if an ‘Add’ and ‘Remove’ happen simultaneously? Because we treat notifications merely as ‘dirty flag’ triggers—always fetching the full state via SMEMBERS—we avoid the complexity of applying deltas. As long as our listener loop processes these ‘refresh signals’ sequentially (waiting for the state update to finish before processing the next notification), the load balancer will always converge to the correct final state held by Redis. Here is the crucial snippet from the listener loop where callbacks are processed sequentially:

    async def _listen(self):
        self.logger.info("Starting main loop")
        async for pubsub_message in self._get_message():
            try:
                op = RedisKeyspaceCommand(pubsub_message.data)
                callback = self.callbacks.get(op)
                if callback:
                    await callback(pubsub_message)
            except ValueError:
                # ignore unknown operations
                continue
            except Exception:
                self.logger.exception("Unhandled error in callback")

Failure modes: what if Redis dies?

A critical requirement for the network infrastructure is to “Fail Open.” If our centralized Redis store goes offline, our load balancers must not stop routing traffic. By caching the upstream list in local process memory, we achieve isolation from Control Plane failures. If Redis becomes unreachable: the psubscribe loop disconnects. The LB logs the error and attempts to reconnect in the background. Crucially, it continues to serve traffic using the last known list of upstreams. While the system loses the ability to update (add/remove nodes) during the outage, the data plane remains operational.

Demo and implementation

Here is the code for the proof-of-concept implementation based on this blog post: https://github.com/b1r3k/simple-distributed-lb

You can run the demo using Docker Compose. The demo setup contains a DNS server, a Redis server, two load balancers, and two upstream servers. When an upstream server starts, it registers itself to the load balancer using the e2e-upstream.sh script. When both upstream servers are started, the e2e tests are executed using the e2e-test.sh script, which sends requests to the load balancer and checks if requests are served according to the round-robin scheme.

$ docker compose up

Further considerations

Availability

For a production environment, we would need to ensure Redis isn’t a single point of failure. We could use Redis Sentinel to provide high availability. Redis Cluster could also be used, but since the list of IPs will be rather tiny, sharding is unnecessary, and the broadcast nature of Pub/Sub in Redis Cluster can generate excessive internal network traffic.

Liveness

The specification does not mention health checks for upstream servers. In a production environment, we would need to implement health checks to at least detect dead upstreams, if not avoid forwarding requests to them. This could be done by having each load-balancer instance periodically check the health of upstreams and report it via Prometheus metrics or similar.

Another thing to consider is the liveness of the load-balancer instances themselves. If one instance goes down, the DNS server should stop returning its IP address to clients. This can be done by using a DNS server with health checks.

Drawbacks of relaying on DNS for load-balancer discovery

First and foremost, the problem is DNS caching and client-side DNS caching. This will lead to an uneven distribution of load between load balancers; some will be used more often than others. For the same reason, it’s hard to remove a load balancer from the list of available ones since clients can still have the old DNS record in a cache.
There is also the curious case of RFC3484, which affects DNS resolution by prioritizing addresses that share the most prefix bits with the source address: specifically section 6, rule 9. That specifies that the selection of an address from multiple A records is no longer random, but instead, the destination address which shares the most prefix bits with the source address is selected, presumably on the basis that it’s in some sense “closer” in the network. That was the case for Windows Vista, for example, but wasn’t implemented the same way in glibc/Debian. ²
Finally, there is a max limit of A records in a DNS response due to UDP packet size. Generally, if a UDP request traverses public networks, it’s safe to assume that it will be truncated at 512 bytes since all hops need to guarantee a certain MTU, and 512 is the lowest common denominator. This means that we can have at most 13 A records in a DNS response. This can be mitigated by using IPVS or a similar pattern. Eventually, it’s possible to return a subset of 13 LB IP addresses for every request and rotate them from a bigger set in a round-robin fashion server-side. ³ ⁴

References

a property of a mathematical operation whose result is insensitive to the order of its arguments ↩
Imminent Death of the Net Predicted: drplokta — LiveJournal ↩
multi-metric packets ↩
RFC 1191 - Path MTU discovery ↩