SSE, caching, and reconnecting problem with solution

A quick summary of an issue we ran into working with Server-Side Events (SSE) and how we resolved it. I haven’t seen this documented anywhere, so hopefully someone might find it useful.

Our SSE service was configured very much along the lines of Andrew Betts’ blog post on the subject, as I’m sure many are.

There is a side effect that can occur for an SSE request/response through Fastly when the backend drops the connection unexpectedly, before the normal connection/cache time out window. This results in a situation where the response cache time is longer than the connection time, which means a complete cached response is now present and will be returned for any request made for that SSE, without keeping the client connection open for further streaming or triggering Fastly to reconnect to the backend. This continues until that cached response expires. This is actually documented here:

For requests to be collapsed together, the origin response must be cacheable and still ‘fresh’ at the time of the new request. However, if the server has ended the response, and the resource is still considered fresh, it will be in the Fastly cache and new requests will simply receive a full copy of the cached data immediately instead of receiving a stream.

With a stable backend, this should rarely happen and not pose a serious problem. The clients will continuously reconnect until the cache expires and a new connection to the backend is opened. But this might become a problem when your SSE backend happens to be a clustered service of some “volitility” where individual server nodes are frequently removed and/or restarted, or you are rate-limiting connections to your SSE service, or you have a large number of clients where a large number of reconnects might pose a problem.

But there is a solution to this issue, and it’s one that can be easily incorporated into almost any Fastly “SSE-aware” configuration with little to no unwanted side effects. When a client makes an SSE request, if the response will be a fully cached response that no longer has an open connection to the backend, that request will pass through vcl_hit and the cached object will have a Content-Length header.

So, using that little bit of knowledge, you can manage this situation by adding the following to vcl_hit before the call to return(deliver):

    # If response is an SSE and is "complete" (has a content-length header), it's effectively stale...
    if (obj.http.content-type == "text/event-stream" && obj.http.content-length && obj.status < 300 && obj.age > 2s && obj.age < obj.ttl && req.backend.healthy && req.restarts < 2) {
        set obj.ttl = 0s;  # Setting obj.ttl to 0 in vcl_hit triggers a purge of the cached object
        restart;  # Try again to get a fresh copy of this.
    }

You’ll see that we check several things to make sure we are only doing this when we should:

  • This is an SSE response
  • The response has a Content-Length header
  • The response status is not a redirect or an error
  • The response has been cached for some time, but less than the cache TTL
  • The backend for the SSE is healthy
  • We haven’t already hit our restart limit

Once we’ve verified all that, we then set the obj.ttl to 0, which will trigger a cache purge of the cached object, dropping it for the next iteration of this request. Then we restart to force this iteration to retry this, which will now open a new connection with the backend.

Note that if you are averse to adding a restart and are OK with allowing a single extra client-side connection attempt, you can remove the restart and allow it to return(deliver) which will deliver this, now stale, object but force it to go to the backend with the next request attempt. Both options will ultimately work.

Important further note that you should avoid calling return(pass) in this situation, as it will open an uncachable/uncollapseable connection to the backend for this instance, potentially leading to many more long-lived backend connections for the same request.

I hope you find this helpful.

7 Likes

Thanks for the time and care that went into documenting this behavior @DrEnter :clap: