How to Recover Inactive PGs Using ceph-objectstore-tool on Ceph Clusters

Introduction

Ceph is a powerful, open-source storage solution designed to provide high performance, reliability, and scalability. As a distributed storage system, Ceph can manage vast amounts of data across numerous servers. However, managing Placement Groups (PGs) within Ceph can sometimes be challenging, especially when hardware or software failures have already made some of the PGs inactive. In this blog post, we'll explore how to export and import inactive PGs using the ceph-objectstore-tool and discuss the different types of replications for PGs.

What Are Placement Groups (PGs)?

Placement Groups (PGs) are a critical component of Ceph's architecture. They act as logical containers that help map objects to OSDs (Object Storage Daemons). This mapping ensures that data is evenly distributed across the cluster, providing fault tolerance and high availability.

Types of PG Replications

Ceph supports two primary types of data replication for PGs: replicated and erasure-coded.

Replicated PGs

In a replicated pool, data is simply copied across multiple OSDs. The replication factor (e.g., 3x replication) determines how many copies of each object exist. For example, with a replication factor of 3, each object is stored on three different OSDs.

Pros: Simple to configure and manage, provides high data availability and redundancy.
Cons: Consumes more storage space (e.g., 3x replication uses three times the storage of the original data).

Erasure-Coded PGs

Erasure coding is a more storage-efficient way to provide data redundancy. Data is split into k data chunks and m coding chunks, creating a total of k+m chunks. For instance, in a 4+2 erasure-coded pool, data is divided into 4 data chunks and 2 coding chunks.

Pros: More storage-efficient than replication, reducing storage overhead.
Cons: More complex to manage and may have higher latency and CPU usage during reconstruction.

Ceph Auto Recovery

Ceph's self-healing and auto-recovery capabilities are some of its most compelling features. When an OSD fails, Ceph automatically detects the failure and starts the recovery process to restore data redundancy. However, Ceph's ability to auto-recover depends on the type of replication used and the number of available replicas or data chunks:

Replicated Pools: Auto recovery is straightforward for replicated pools as long as at least one copy of the data remains available. For instance, in a replication factor of three, Ceph can auto-recover even if two out of three replicas are lost, as long as one replica is still accessible.
Erasure-Coded Pools: The situation is slightly more complex for erasure-coded pools. The recovery depends on the (k, m) coding scheme, you need at least k data chunks to reconstruct the original data. In a 4+2 pool, any 4 out of the 6 chunks are sufficient to restore the data.

In some cases, Ceph might be unable to recover PGs automatically. This situation can arise due to insufficient replicas or chunks, severe hardware failures, or prolonged network issues. When Ceph is unable to recover inactive PGs on its own, manual intervention becomes necessary.

Ceph manual recovery

To restore the functionality of the PGs, we can manually export and import them using the ceph-objectstore-tool, which could help if the faulty storage devices are still accessible. The OSDs might fail to start or crash because of corruption. However, it's possible that the PGs we're trying to retrieve aren't the cause of the crash and can be extracted successfully.

This process involves identifying the inactive PGs, exporting them from their current location, and importing them to a healthy OSD:

Identify the affected PGs and OSDs

1. Check the health details to obtain the inactive PG ID:

root@vl-mgmt ~ $ ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; 3 osds down; Reduced data availability: 10 pgs inactive, 1 pg down; Degraded data redundancy: 35/172 objects degraded (20.349%), 13 pgs degraded
...
[WRN] PG_AVAILABILITY: Reduced data availability: 10 pgs inactive, 1 pg down
...
pg 16.0 is down, acting [NONE,NONE,NONE,3]
...

This PG is part of an EC 2+2 pool. pg 16.0 is down, acting [NONE,NONE,NONE,3] indicates that we only have 1 out of 4 chunks available. Our goal is to restore one more chunk to allow Ceph to recover on its own (it needs at least "k=2" chunks to be able to do so).

2. Try to find the rest of the past acting set (the down OSDs) by dumping the PG info:

root@vl-mgmt ~ $ ceph pg 16.0 query
{
...
"recovery_state": [
{
"name": "Started/Primary/Peering/Down",
"enter_time": "2024-06-04T20:03:24.278534+0000",
"comment": "not enough up instances of this PG to go active"
},
{
"name": "Started/Primary/Peering",
"enter_time": "2024-06-04T20:03:24.278488+0000",
"past_intervals": [
{
"first": "1217",
"last": "1224",
"all_participants": [
{
"osd": 3,
"shard": 3
},
{
"osd": 4,
"shard": 2
},
{
"osd": 8,
"shard": 0
},
{
"osd": 14,
"shard": 1
}
],
...
}

3. Find the host that contains any of the down OSDs:

 root@vl-mgmt ~ $ ceph osd find 4
{
    "osd": 4,
...
    "host": "vl-srv1",
    "crush_location": {
        "host": "vl-srv1",
        "root": "default"
    }
}

4. List all the PGs in the down OSD with the ceph-object-store tool:

root@vl-srv1 ~ $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-4 --op list-pgs
...
16.0s2
...

Export the PG

1. Connect to the host and stop the OSD service containing the PG if it's not already stopped, and mask it to prevent accidental start-up by someone else:

root@vl-srv1 ~ $ systemctl stop ceph-osd@4.service
root@vl-srv1 ~ $ systemctl mask ceph-osd@4.service

2. Export the entire PG shard to a file:

root@vl-srv1 ~ $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-4 --pgid 16.0s2 --op export --file pg.16.0s2.dat
Exporting 16.0s2 info 16.0s2( v 1170'26 (0'0,1170'26] local-lis/les=1221/1222 n=25 ec=1156/1156 lis/c=1221/1217 les/c/f=1222/1218/0 sis=1221)
Read 2#16:0b2e8e02:::100000003ea.00000001:head#
...
Read 2#16:ffb5c1f5:::100000003ea.00000013:head#
Export successful

Ensure the --file parameter's path has enough space for the temporary file.

Import the PG

1. Copy the PG file to another server with an available OSD and enough capacity to receive the new PG.

You can place the PG on any OSD device class (SSD/HDD/ETC), if you don't match the pool crush rule of the PG, Ceph will mark it as misplaced and move it to the correct device.

2. Stop the target OSD service and mask it to prevent accidental start-up:

root@vl-srv2 ~ $ systemctl stop ceph-osd@19.service
root@vl-srv2 ~ $ systemctl mask ceph-osd@19.service
root@vl-srv2 ~ $ systemctl status ceph-osd@19.service
○ ceph-osd@19.service - Ceph object storage daemon osd.19
     Loaded: masked (Reason: Unit ceph-osd@19.service is masked.)
    Drop-In: /etc/systemd/system/ceph-osd@.service.d
             └─override.conf
     Active: inactive (dead) since Wed 2024-05-08 17:11:44 UTC; 13s ago

3. Import the PG file into the target OSD:

root@vl-srv2 ~ $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-19 --no-mon-config --pgid 16.0s2 --op import --file ./pg.16.0s2.dat
get_pg_num_history pg_num_history pg_num_history(e1267 pg_nums {6={447=1},7={710=8},8={715=8},9={720=32},10={725=8},11={730=8},12={735=64},13={741=128},14={1134=32},15={1141=128},16={1156=1}} deleted_pools 1017,5,1019,4,1022,13,1024,12,1026,11,1028,8,1030,10,1032,9,1034,7,1133,6)
Importing pgid 16.0s2
Write 2#16:0b2e8e02:::100000003ea.00000001:head#
snapset 1=[]:{}
...
Write 2#16:ffb5c1f5:::100000003ea.00000013:head#
snapset 1=[]:{}
write_pg epoch 1222 info 16.0s2( v 1170'26 (0'0,1170'26] local-lis/les=1221/1222 n=25 ec=1156/1156 lis/c=1221/1217 les/c/f=1222/1218/0 sis=1221)
Import successful

4. Unmask and start the target OSD:

root@vl-srv2 ~ $ systemctl unmask ceph-osd@19.service
root@vl-srv2 ~ $ systemctl start ceph-osd@19.service

5. Back to the management node, you can check the PG status with:

root@vl-srv2 ~ $ ceph pg 16.0 query
...
    "recovery_state": [
        {
            "name": "Started/Primary/Peering/Down",
            "enter_time": "2021-10-04T09:40:40.344462+0000",
            "comment": "not enough up instances of this PG to go active"
...

You may need to mark the non-existing OSDs as lost if the PG is not recovering with ceph osd lost osd.4 --yes-i-really-mean-it

6. Once the recovery process is complete, you will see the PG with the new acting set, active and clean again:

root@vl-srv2 ~ $ ceph pg 16.0 query
{
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "state": "active+clean",
    "epoch": 288,
    "up": [
        19,
        7,
        5,
        3
    ],
    "acting": [
        19,
        7,
        5,
        3
    ],
  ...
      "recovery_state": [
        {
            "name": "Started/Primary/Active",

Conclusion

Exporting and importing inactive PGs using the ceph-objectstore-tool is a critical disaster recovery process. By following these steps, you can restore inactive PGs, ensure data accessibility, and bring your storage system back to a functional state.

Understanding the types of PG replications and the requirements for restoring data is crucial for effective Ceph cluster management. Whether you are dealing with replicated or erasure-coded pools, knowing how many replicas or chunks are needed to restore data can help you make informed decisions and maintain high availability in your cluster.