Engineering Blog

Publiziert am 29. Januar 2025 von

Verbesserte Messung der Nutzung von Object Storage

Dieser Inhalt ist nur auf Englisch verfügbar:

How do we know how much we should charge for your Object Storage usage? A journey into rgw-metrics.

Cloudscale offers S3-compatible Object Storage built on top of our Ceph storage cluster with three-fold replication. To provide the S3 service, we use Ceph's RADOS Gateway (radosgw). While radosgw includes built-in usage tracking, we found its metrics insufficient for the needs of a public cloud provider like us.

Historical Object Storage usage data in the Control Panel.

In the Objects tab in our Control Panel, customers can view their exact usage over time and as an example see the number of requests on a specific date. This detailed data is not just for customer insights, it is also critical for accurate billing. Beside the number of requests, the metrics include the number of objects, the used storage and the network traffic.

To bridge the gap in capabilities, we developed our own solution: rgw-metrics.

What is rgw-metrics?

rgw-metrics is a microservice that repeatedly collects the current usage data for every bucket from radosgw. This data is aggregated into the current hourly segments, which are persisted. This usage data is then queried by the Control Panel through an API provided by rgw-metrics. This API is quite narrow and was stable over the years. It only allows to fetch metrics for a single or for multiple object users.

┌────────────────┐       ┌───────────────┐       ┌─────────────────┐
│  Contol Panel  ├──────►│  rgw-metrics  ├──────►│  RADOS Gateway  │
└────────────────┘       └───────────────┘       └─────────────────┘

Designed as a standalone microservice, running on both of our sites, means it operates independently of the Control Panel. This independence ensures metrics are consistently collected, even during extended maintenance periods.

A journey of evolution

The first version of rgw-metrics was written in Flask back in 2017 when we first introduced our S3 storage. While functional, the application had received little maintenance since its launch. Over time, this led to challenges, the outdated dependencies, manual deployment steps and the fact that the Control Panel is build with a different framework, Django, made engineers cautious about touching the application.

To address these issues, we decided on a black-box rewrite of rgw-metrics, transitioning it from Flask to Django.

The black-box rewrite approach

To ensure a seamless transition, we prioritized maintaining the existing API's behavior. That way we were able to create a collection of tests to validate the new service against the existing one. For instance, we compared the historical usage data from our public acceptance tests over the past year. Together with countless other internal projects using the Object Storage. During the development, we ran a script to compare the output of the new Django-based service with the original Flask-based implementation. This ensured the output of the new service matched the old one under various scenarios.

# essentially, it was automating these steps:
curl -H "$AUTH_HEADER" "https://old-api.cloudscale.ch/v1/metrics/buckets?start=2023-12-31&end=2024-01-01" > "export_flask/metrics.json"
curl -H "$AUTH_HEADER" "https://api.cloudscale.ch/v1/metrics/buckets?start=2023-12-31&end=2024-01-01" > "metrics.json"
diff export_flask/metrics.json metrics.json

Thanks to this test-driven method we acutely found multiple bugs, including one in our data migration scripts. An existing column was copied to the wrong target column.

What is up next for rgw-metrics?

With the rewrite complete, rgw-metrics now benefits from up-to-date dependencies, a container based deployment, similar to our main application, and a similar structure, which will help us develop additional features.

With the foundation strengthened, we are ready to tackle upcoming improvements like the efficient detection of large buckets: Each bucket has a limit of 10 million objects. Beyond this threshold, performance may degrade. Currently, we proactively contact users approaching this limit. However, gathering the necessary data through the current endpoints is suboptimal, as it requires iterating over every bucket for each object user. The amount of object users is growing every day, this forces us to extend the API to allow for more efficient queries on large buckets, reducing overhead and improving responsiveness.

Stay tuned as we continue to enhance our metrics system and provide an even better Object Storage experience for our users.

Wenn du uns Kommentare oder Korrekturen mitteilen möchtest, kannst du unsere Engineers unter engineering-blog@cloudscale.ch erreichen.

Zurück zur Übersicht