Stale content on error with shielding


#1

Hi,

We’re using the VCL as outlined here to serve up stale content on error responses from our backend. It appears I’m running into the situation where the edge nodes think that a stale response coming back from the shield node is actually content that should be cached and then cache it. Once we start getting good responses back from our origin, the shield knows to use that response and not the stale response. But because the edge cached the stale content that originally came from the shield, that edge cluster continues serving up the stale content.

Example, with edge nodes Y and Z, and shield node X:

request -> Y -> X -> origin -> content (v1)
soft purge content
request -> Y -> X -> origin -> error -> content (v1)
request -> Y -> content (v1)
request -> Z -> X -> origin -> content (v2)
request -> Z -> content (v2)
request -> Y -> content (v1)

Has anyone run into this issue? If so, how did you solve it? I’m thinking that I’ll need to set some header on the response object that indicates the content is stale and shouldn’t be cached by the edge nodes. Does that sound like the right approach?

Thanks!

  • Ray


#2

Hi Ray-

I was actually investigating a similar issue over the weekend and have filed an issue about how having an origin shield configured kind of complicates this interaction. With soft-purge and SWR configured, you should expect the first response post-purge to be a stale response while it refreshes in the background, and you see that when origin shield is not involved.

I do eventually see the fresh object getting delivered but it takes more requests than I would expect it to. e.g. i just tested this and it seems like there were ten or so requests before everything was consistent.

I think what is happening is that in a POP, there are primary and secondary storage locations for any given object where they get written to disk. Busy objects will get cached locally on the rest of the nodes in the POP in RAM. I think what’s happening is that the SWR response is getting returned from each of the cache nodes where the objects are resident in RAM, so you get more stale-while-revalidate responses while the cache nodes refresh asynchronously.

I can update this when I get confirmation.


#3

Hi Peter,

Thanks for the response. That’s really interesting behavior you’re seeing. If you take a look at the TTLs of the objects being served up from the edge nodes, what do you see? Are they getting set as they normally would for content delivered from your shield?

In my case we’re setting large TTLs on our objects, and it appears that even when the shield serves up that initial stale object, the edge nodes set the full TTL on the response. So it looks to me like they’re fully caching the object on the edge. I’ll see if I get similar behavior to you after a bunch of requests.


#4

I do see inconsistency though. I’ve opened an issue with the Varnish team and will report back with their findings.

–Peter


#5

Hi rayd -

sorry for the delay. It turns out that when one has Stale-While-Revalidate and shielding configured and does a soft-purge, the edge nodes, when they receive a new object from the shield (a SWR response), it doesn’t recognize that object as a SWR response.

The way to get around this is by detecting if the request is shield request and disabling SWR for those requests.

This can be done in custom VCL above the #FASTLY recv macro

sub vcl_recv {
   if (req.http.Fastly-FF) {
      set req.max_stale_while_revalidate = 0s;
   }

#FASTLY recv

hope this helps.