Making queryable logs work at large scale

I’m a sysadmin for the OpenStreetMap Standard Layer, which has the CDN sponsored by Fastly. I wrote a post for the OpenStreetMap community about how we do our log processing, and thought it might be of interest to people here.

We’re sitting about 32k requests per second, which is a large volume of logs, and we need to be able to query them systematically while not blowing the budget on S3 costs, or using proprietary expensive logging platforms. We also can’t use a sampling-based approach.

The Standard Tile Layer has a lot of traffic. On August 1st, a typical day, it had 2.8 billion requests served by Fastly, about 32 thousand a second. The challenges of scaling to this size are documented elsewhere, and we handle the traffic reliably, but something we don’t often talk about is the logging. In some cases, you could log a random sample of requests but that comes with downsides like obscuring low frequency events, and preventing some kinds of log analysis. Critically, we publish data that depends on logging all requests.

We query our logs with Athena, a hosted version of Presto, a SQL engine that, among features, can query files on an object store like S3. Automated queries are run with tilelog, which publishes files daily to generate published files on usage of the standard tile layer.

As you might imagine, 2.8 billion requests is a lot of log data. Fastly offers a number of logging options, and we publish compressed CSV logs to Amazon S3. These logs are large, and suffer a few problems for long-term use because they:

  1. contain personal information like request details and IPs, that, although essential for running the service, cannot be retained forever;
  2. contain invalid requests, making analysis more difficult;
  3. are large, being 136 GB/day; and
  4. become slow to query, being compressed gzip files with the only indexing being the date and hour of the request, which is part of the file path.

To solve these problems we reformat, filter, and aggregate logs which lets us delete old logs. We’ve done the first two for some time, and are now doing the third.

The first step is to filter out non-tile requests, rejected requests, and convert to a better data format for querying. Tilelog does this by running a SQL query to update the successful request logs based on the raw request logs. This query filters to just successful tile requests, converts paths to x/y/z coordinates, and converts text to numbers where applicable. This cuts the file size in half to 71 GB/day, but critically improves query performance because Parquet is both faster to parse than gzipped csv and is a columnar store. Columnar stores are good for queries which only fetch data from some columns. Since all the queries we normally run fetch data from all rows but only some columns, this is a huge performance boost because less data needs to be read.

For a long time this was all the filtering that was done, but because each successful request results in a row, they’re still large log files. Additionally, they retain personal information which cannot be retained forever.

To reduce data volumes and personal information further, we need to know what information is useful long-term. Experience has shown that queries run on historic data are of two types

  1. tile focused, where the query is trying to answer a question about what tiles are accessed; or
  2. request focused, where the query is looking at what software has been requesting tiles, but doesn’t need to know which tiles were accessed.

The first requires a more detailed version of the tile request logs which have how many requests have been made for each tile. This is done by a query that aggregates based on tile, time up to the hour, approximate requestor region, and Fastly datacenter. This brings the size down to 17 GB/day. This is enough information to create tile view animations which show how tiles are accessed around the globe.

The second is more focused on abuse prevention and historical analysis of what software uses the standard layer. For this, a query that aggregates based on time up to the hour, requestor information, and HTTP headers is used. This brings the size down to 8 GB/day.


Thanks for sharing @pnorman :+1: