Real cache retention times

Hi,
We are hosting around 50GB of technical documentation from many different versions of the product, which is a lot of html files, and to some extent a lot of duplication.
The VCL is configured for a cache time, TTL of 1 month.
Ideally, when you visit a page from the latest version, it will be a HIT.
The older versions would be more likely to be a MISS.

In practice, what seems to be happening, is if you view a random page, the first time it’s a MISS.
Then for some time later it is served from the Fastly cache.
Consider the fact that search engine bots are constantly hitting webpages and pulling them into the cache. Almost certainly, within the last 1 month, a bot has visited the webpage and caused it to be cached.
But when I visit the page, a few days later, it’s a MISS.

QUESTION:

Would there be a way to force the caching of a few versions? 1GB, or 2GB, or 3GB of technical documentation, that will not be purged, no matter what?

Or the Fastly algorithm just says “This customer is using a ton of storage. So as soon as any webpage hasn’t been visited in an hour, we are going to purge that file”. And it frequently purges most of the files. When a visitor goes to a random page (embedded among many others), they are often going to get a MISS. Thus, browsing the pages might be slow for them. Is this topic discussed already in the docs someplace? Thanks.

Caching is a complex subject and algorithm; there are dozens of variables involved. While Fastly does offer the possibility of ‘cache reservation’ (holding content in the cache despite it not being accessed frequently), that’s something which is really only relevant for large customers with substantial content sets (in other words, it’s expensive).

Content which is only accessed infrequently (less than once per day, for example) is probably not likely to stay in our caches between those accesses, as the caches are shared across thousands (or millions) of sites, many of which have extremely high traffic levels. Content expires from the cache based on the caching control headers provided by the origin (and any changes to those which may happen during handling in a Delivery or Compute service), but also may just get pushed out of the cache if more-active content from another site needs to use the storage space.

Thanks for the answer. It’s an interesting problem. We are also hosting some large frequently accessed downloadable files, leveraging the Fastly cache very well. But I guess documentation pages where each page is visited on-average once-per-day is not as effective as I hoped.

1 Like

Are you making use of shielding, in a POP that is near (network-wise) to your origin? I’m curious what ‘MISS’ times you are seeing, since you report that access to content that gets a MISS is ‘slow’, but ‘slow’ is very subjective :slight_smile:

How slow is slow… I have been thinking about how to measure that, and have an idea to “wget mirror” a section of the website, comparing HIT, MISS, etc. Then I could post it.

Both MISS and HIT are in the 0-2 second range. but it’s still a noticeable difference. an issue is the origin web-server is retrieving content from AWS S3. A CDN cache-hit wouldn’t need to do that. But then one of the purposes of a front-end cache is the way it skips such processing steps, and can just deliver content instantly.

Results:

  • The original baseline was a VM at rackspace.
  • Fastly HIT - cached content is 4 times faster than baseline.
  • Fastly MISS - not-cached content served from a new backend including a round-trip to S3, up to 4 times slower than original.

I have just enabled Shielding. We’ll see how that affects statistics in the next few days.

At least for the current configuration shielding is probably an overall benefit.

  • Origin Offload increased, Miss Time decreased. Great.
  • Side-effects: hit ratio slight decrease, more cache misses.

However since the CDN is not serving near 100% of content, we must focus on the latency and performance of the origins.

Thanks for sharing! You’re absolutely right, origin performance is quite important for sites that don’t have heavy traffic.. that seems counter-intuitive, but the CDN can do a better job of origin offload when the traffic level is high.