Evaluating PG numbers

Getting the PG numbers right is crucial for a well balanced and maintainable cluster. Let's see how to calculate them manually:

PGs per OSD

If you look at the CRUSH map, it will tell you how many PGs each OSD has. We are aiming for ~ 100 PGs, but it's better to overshoot, rather than undershoot here. Just don't go too overboard, or you might end up with PGs stuck in peering.

Manual PG calculations

Replicated pools make calculating PG numbers a bit easier. We want to make sure all PGs are about the same size.

Let's say we have 3 pools and 1000 OSDs:

#pool       raw       total
replica_3   250TB     (* 3 = 750 TB)
ec_4_2      500TB     (* (1 / 4 * 2 + 1) = 750 TB)
ec_8_3      1PB       (* (1 / 8 * 3 + 1) = 1375 TB)

Let's calculate PG size if each pool had a single PG:

#pool       total / chunks      PG Size
replica_3   750 TB  / 3       = 250 TB
ec_4_2      750 TB  / (4 + 2) = 125 TB
ec_8_3      1375 TB / (8 + 3) = 125 TB

Since we are aiming for 100 PGs per OSD adjusted for PG size we end up with:

50 PGs per OSD for replica_3
25 PGs per OSD for ec_4_2
25 PGs per OSD for ec_8_3

Our final PG numbers are:

#pool       pgs * osds / chunks   pg_num_ideal          rounded to power of two
replica_3   50 * 1000 / 3       = 16666.6666666667    = 16384 PGs
ec_4_2      25 * 1000 / (4 + 2) = 4166.6666666667     = 4096 PGs
ec_8_3      25 * 1000 / (8 + 3) = 2272.7272727273     = 2048 PGs

Autoscaler

Ceph pools have a caveat of not staying the same size forever, so keeping pg_nums perfect is unrealistic. Ceph upstream has build the autoscaler which is designed to automatically adjust the pg_num.

On it's own, the autoscalers recommendations are not ideal:

It starts resizing PGs whenever a pool shrinks or grows, causing a lot of data movement. That's why the croit default keeps the autoscaler in warn mode. This means it will show recommendations as Ceph health warnings.

To improve it's recommendations, you have a few options:

You can specify the pools target size:

ceph osd pool set $POOL_NAME target_size_bytes 100T

You can specify the pools target ratio (takes precedence over target size):

ceph osd pool set mypool target_size_ratio 1.0

Ceph will warn you when the pool size is bigger than the target_size, so you'll know when to update the value.

You can check recommendations with:

ceph osd pool autoscale-status

Why can I not maintain the Autoscaler in croit?

We are working on it :)

On this Page:

Evaluating PG numbers

PGs per OSD

Manual PG calculations

Autoscaler

Why can I not maintain the Autoscaler in croit?