Fastly and S3 - dealing with 'read after update' eventual consistency


#1

Apparently, read after write for new keys in S3 has been consistent for some time now. However, read after update is still eventually consistent.

I read an article from Hashicorp about how they use a surrogate key to invalidate all S3 data for a given website: https://www.hashicorp.com/blog/serving-static-sites-with-fastly--s3--and-middleman/

But that doesn’t seem to deal with the issue of updated content? It still seems possible for this flow:

  • New static site is generated (some new pages, but mostly updated pages)
  • Fastly cache is invalidated for whole S3 static website
  • Page is requested via Fastly before updated key is consistent on S3
  • Fastly forever serves the old S3 page

I know Google’s equivalent, GCS, doesn’t have this issue, but wonder how to mitigate this with S3?


#2

Here is a workaround. Set your desired TTL for stale-while-revalidate, and use a shorter TTL for the cache, so Fastly will continue to always serve from cache, but newer objects will be picked up eventually. When the objects don’t change, Fastly should get a 304 Not Modified response from S3, so the added bandwidth in this setup should be minimal, even for large objects.


#3

Thanks @iopov

I’m not sure that covers my requirement unfortunately. I have a bunch of HTML pages in S3. Vast majority of pages won’t be changing for months, but individual pages might be updated and users would expect to see updates within a few seconds. Hence we have a massive range.

The flow I wanted was:

  • Page uploaded to S3 for an existing key
  • Single page purge submitted to Fastly
  • User requests page (set for no caching on client side)
  • Fastly gets the page, caches for a very long period (months)

By all accounts ‘eventual’ read-after-update in S3 can be within 100ms, so it’s only an edge case here. It’s just if Fastly gets a request just after the purge, but just before S3 object is consistent.

Your workaround I believe is for relatively short TTLs?


#4

I wondered if Edge dictionaries might work?

Instead of submitting a purge, a key+hash is submitted to the edge dictionary, and it will check for that hash on the S3 response headers. If hash matches, it caches if and removes from the dictionary.


#5

Why not do delayed flush. I.e. purge immediatey, then, using resque or any other scheduler, purge after a second and again after minute. Is there any guarantee on s3 update?


#6

The author of the post I linked to claimed that reading from a bucket in the same region after an update was consistent. However, Amazon’s own documentation claims otherwise.

Someone did a test of 100,000 writes, and in one region found just one read after overwrite inconsistent: https://github.com/andrewgaul/are-we-consistent-yet

This idea of waiting some indeterminate time comes up all the time in software, and I never like it. In this instance it would mean blindly doing something for 99,999 writes, that was actually only needed once.

It would be good if someone from Fastly would pitch in, as their documention mentions S3 backends, but not this consistency issue.


#7

I’ve been testing this particular use case recently, so let me pitch in :wink:

It’s probably best to read about the Amazon S3 Data Consistency Model:

Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all regions… (If) A process replaces an existing object and immediately attempts to read it. Until the change is fully propagated, Amazon S3 might return the prior data.

So unfortunately there is an undefined amount of time between the object being replaced on S3 and Fastly being able to pull the new version. If you replace an object and then purge on Fastly, occasionally Fastly will pull the old version of the object after the purge request, which would be bad.

Lowering the TTL seems to be the best solution at this moment.

Changing object store might be a large undertaking, but GCS does not have this limitation. Google Cloud Storage Consistency says:

When you upload an object to Cloud Storage, and you receive a success response, the object is immediately available for download and metadata operations from any location where Google offers service. This is true whether you create a new object or overwrite an existing object. Because uploads are strongly consistent, you will never receive a 404 Not Found response or stale data for a read-after-write or read-after-metadata-update operation.


#8

Hey @LeonBrocard

That’s exactly what I did, hence the entire premise of the question :slight_smile:

However, it seems Fastly documentation doesn’t acknowledge this, which might result in some head scratching.

GCS is indeed something that manages to make read-after-update consistent - in fact, did you see me post prior to yours listing consistency results for multiple providers?

I wouldn’t mind using GCS at all, but that generally means one has to run everything else there too, otherwise traffic between cloud providers adds enormously to the bill.