Why isn't serve stale working as expected?


#1

Here are some things to consider when troubleshooting why a stale object wasn’t served from cache while your origins were weathering temporary issues:

  • Custom VCL: the most effective way to set up serving stale is to use this updated document to instantiate all the proper code in each subroutine. This can only be set up via custom VCL upload, not through the UI.

  • Cache: Stale objects are only available for cacheable content.

  • Shielding: If you don’t have shielding enabled, each datacenter can only serve stale under the condition that a request for that cacheable object was made through that datacenter before. Enabling shielding will increase the probability that stale exists and it’s a good way to refill the cache a little faster after a Purge All.

  • Requests: As traffic to your site increases, you’re more likely to see those stale objects available (even if shielding is disabled) as it’s reasonable to assume requests for a hot asset will come into various datacenters and get cached at multiple locations.

  • LRU: As a follow-up, Fastly does have an LRU (least recently used) list, so objects are not necessarily guaranteed to stay in cache for the entirety of their TTL. But eviction is dependent on many factors, including the object’s request frequency, its TTL, the POP from which it’s being served, etc. For instance, objects with a TTL of longer than 3700s get written to disk, whereas objects with shorter TTLs end up in transient, in-memory-only, storage. Set your TTL to >3700s when possible.

  • Purges: Limiting purges (issuing a purge by URL or surrogate key versus a Purge ALL) and utilizing our new soft purge feature can help ensure that your content remains in our caches to serve stale.


#2

Hi,
It’s not clear if setting TTL to >3700 is absolute MUST in order for stale to work, or it just lowers the probability of having it cached long after TTL ends.

Can you explain how exactly it will work, if I set TTL to 10m and stale_if_error to 30d, and then origin goes down for few hours?
I’d guess it depends on when the object was last accessed (and thus refreshed). So 3 cases:

  1. Last user accessed it minute ago - object still within TTL.
  2. Last user accessed it 11 minutes ago - object just passed TTL.
  3. Last user accessed it week ago - object long ago passed the TTL, but still within stale_if_error time.

Thanks.


#3

Hey Ilya,

Having a Time To Live(TTL) of 3700s or above isn’t necessary for stale to function, but not having a TTL of that timespan definitely increases the odds that serving stale will fail when you need it. Any object with a TTL of less than 3700s will be stored in temporary memory on our cache nodes, and has a higher likelihood of being evicted, which increases the chance of stale not working because there isn’t a stale object for us to serve.

For each of the options you’ve outlined, you’ve written out the optimal response; the server is down, but in the first case, the object is still cached, so that is returned to the client.

In the second case, the object cached is no longer valid so we ask your origin for a new object; since your origin is down, we’ll end up returning the stale object, as long as the stale object is still in our cache (which is why setting the TTL higher is so important).

In the third case, it’s highly unlikely that the object will still exist in our cache, as the TTL is less than 3700, so the likelihood is for that one is that we won’t have an object to serve, and if your origin is down, we’ll have to return a 503.

Let me know if you have any other questions,
Amanda


#4

Do you plan on changing this functionality in future?
For example, write object to disk if it’s need to be thrown out of memory but it still has grace period.

It seems to me, that the case that I mentioned in example, is highly used case, where page is cached with some reasonable refresh rate, 10m-1h, less than 3700, but still with desire to server stale in case of errors.


#5

Hey Ilya,

At this time, we have no plans on changing the setup. An object with a low TTL is one that needs to be refreshed constantly (whether or not the object is changing), and isn’t suited for being stored on SSD.

If you have an object that isn’t being accessed very often, and doesn’t change so much that a synthetic error page wouldn’t be preferable to the stale object, then I would recommend increasing the standard TTL, and using our soft purge feature when the object does change (I specify soft purging so that you maintain your Stale cache in case of error).

Best,
Amanda