Skip to content

Write common code to handle Fastly logging #1332

@pnorman

Description

@pnorman

Currently the tile CDN writes logs to s3 that are processed by tilelog to be turned into a more usable form. This started with raster tiles then expanded to vector tiles but doesn't cover other services behind Fastly which includes or will include Nominatim and the website.

With five services behind Fastly we should have common log processing code that will make the logs easily searchable and allow aggregation for both predefined and ad-hoc analysis.

The tile dataflow is Fastly logs to compressed csv on s3 every 10 minutes in paths partitioned by year, month, day, and hour. These logs are about 1.0 TB/week.

Every day tilelog runs queries in Athena that reads successful requests from the CSVs and writes them to parquet files. These files are about 780 GB/week but are substantially faster to query due to their format. They are partitioned the same way as the CSVs. They also break turn free-form text fields like request into tile coordinates.

After writing successes to parquet files summarized files are produced. One dataset has requests are grouped by tile zoom, client information like user-agent, and request country. These logs are about 70 GB/week and can be retained longer because they have PII like IP addresses removed. They are partitioned by the same way as the CSVs.

A second dataset is produced with tile location but information like user-agent removed. These logs are about 80 GB/week and are retained and partitioned like the previous logs.

Once these internal datasets are produced they are used to produce the public log files at https://planet.openstreetmap.org/tile_logs/. This is the only part that needs to run daily.

This approach has worked well for tiles but suffers some drawbacks

  • aggregation currently fails at sustained hourly requests of 75k TPS or higher. This is solvable.
  • the workflow should work fine on trino (formerly presto) but is run on AWS' hosted closed-source version.
  • all the queries are only run once a day. Hourly would be better for responding to attacks. The CSV logs can be queried but are much slower.
  • Failures are not handled gracefully. If it were run hourly then a failure in an hourly run but not the daily run would produce incorrect numbers in the public logs.
  • Table definitions have to agree between the opentofu fastly configuration, the Athena table definitions, and tilelog

Proposal

  • Write new software that performs format conversions and aggregations for all services behind a Fastly CDN and is designed to run hourly
  • Continue with use of Athena for it's proven query usability
  • Move athena table creation into the new code
  • Investigate more structured log formats than CSV
  • Identify paramaters key to different service requests, e.g. tile z/x/y for tiles

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions